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Chapter  1 

Introduction 


In  distributed  database  systems,  logical  data  items  are  often  replicated  in  order  to  improve 
availability,  reliability  and  performance.  Whenever  replication  is  used,  a  replication  algo¬ 
rithm  is  required  in  order  to  ensure  that  the  replication  is  transparent  to  the  user  programs. 
In  understanding  replication  algorithms,  it  is  convenient  to  think  of  each  logical  data  item 
as  being  implemented  by  a  collection  of  data  managers  (DMs)  and  transaction  managers 
(TMs).  The  DMs  retain  state  information,  and  the  collective  state  of  the  DMs  defines  the 
current  state  of  the  logical  data  item.  The  user  programs  invoke  TMs  in  order  to  read  or 
write  the  logical  data  item;  the  TMs  accomplish  this  by  physically  accessing  some  subset 
of  the  DMs. 

One  of  the  most  well-known  replication  algorithms  is  Gifford’s  algorithm  [Gi],  which  we 
call  Quorum  Consensus.  Based  on  Thomas  [Tj,  the  ideas  of  this  method  underlie  many 
of  the  more  recent  and  sophisticated  replication  techniques  (e.g.,  [ASC,AT,ES,He]).  In 
Gifford’s  algorithm,  each  DM  is  assigned  a  certain  number  of  votes  and  keeps  as  part  of 
its  state  a  data  value  with  an  associated  version  number.  Each  logical  data  item  z  has  an 
associated  configuration  that  consists  of  a  pair  of  integers  called  read-quorum  and  write- 
quorum.  If  v  is  the  total  number  of  votes  assigned  to  DMs  for  z,  then  the  configuration  is 
constrained  so  that  read-quorum  -f  write-quorum  >  v.  To  read  z,  a  TM  collects  the  version- 
numbers  and  values  from  enough  DMs  so  that  it  has  a  read-quorum  of  votes;  then  it  returns 
the  value  associated  with  the  highest  version-number.  To  write  z,  a  TM  first  collects  the 
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version- numbers  from  enough  DMs  so  that  it  has  a  read-quorum  of  votes;  then,  it  writes  its 
value  with  a  higher  version  number  to  a  collection  of  DMs  with  a  write-quorum  of  votes. 
This  method  generalizes  both  the  read-one/write-all  and  the  read-majority/write-majority 
algorithms. 

Here,  we  adopt  a  slightly  more  general  configuration  strategy,  which  is  justified  by 
Barbara  and  Garcia-Molina  in  [BaGaj:  A  configuration  consists  of  a  set  of  read-quorums 
and  a  set  of  write-quorums.  Each  quorum  is  a  set  of  DM  names,  and  every  read-quorum 
must  have  a  non-empty  intersection  with  every  write-quorum.  To  read  a  data  item,  a  TM 
accesses  all  the  DMs  in  some  read-quorum  and  chooses  the  value  with  the  highest  version 
number.  To  write  a  data  item,  a  TM  first  discovers  the  highest  version  number  written  so 
far  by  accessing  ail  the  DMs  in  some  read-quorum;  then  the  TM  increments  that  version 
number  by  one  and  writes  the  new  value  and  version  number  to  all  the  DMs  in  some 
write-quoium. 

In  this  thesis,  we  generalize  Gifford’s  algorithm  in  three  fundamental  ways.  First,  we 
incorporate  the  concept  of  transaction  nesting  into  the  algorithm.  Transaction  nesting  is 
useful  in  its  own  right  (for  instance,  as  the  basis  of  the  distributed  programming  language 
ARGUS  [LHJLSW,LiS,Mo,We]).  In  addition,  it  turns  out  that  nested  transactions  provide 
a  useful  way  of  understanding  replication  algorithms  even  if  user  transactions  are  not  nested 
(as  in  Gifford  [Gi]) .  This  is  because  the  TM’s  themselves  can  be  regarded  as  subtransactions 
of  the  user  transactions.  Once  one  sees  how  to  understand  the  algorithm  in  this  way,  it  is 
very  natural  to  generalize  the  algorithm  to  allow  nesting  of  user  transactions  as  well.  Second, 
we  extend  the  algorithm  to  accommodate  transaction  failures  (aborts).  Thus,  for  example, 
an  operation  to  access  a  logical  data  item  can  complete  even  if  some  of  its  associated  DM 
accesses  abort.  Finally,  we  provide  a  fully-developed  version  of  the  mechanism  outlined  by 
Gifford  for  changing  the  read-  and  write-quorums  dynamically.  This  capability,  known  as 
reconfiguration,  generally  is  used  for  coping  with  site  or  link  failures.  We  obtain  a  single 
algorithm  that  integrates  all  three  generalizations. 

We  present  our  algorithm  using  the  new  framework  of  Lynch  and  Merritt  [LM]  for 
modeling  nested  transaction  concurrency  control  and  recovery.  For  clarity,  we  present  two 
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versions  of  the  algorithm.  In  the  first  version,  we  assume  that  the  configuration  for  each 
data  item  is  fixed  and  is  known  in  advance  to  all  the  TMs  that  access  that  data  item.  The 
second  version  includes  the  reconfiguration  mechanism.  The  descriptions  are  clear,  simple, 
and  unambiguous.  A  complete  correctness  proof  is  also  included;  it  is  short,  natural,  and 
intuitive,  yet  completely  rigorous. 

An  important  reason  for  the  simplicity  of  the  proof  is  the  fact  that  we  are  able  to 
separate  the  treatment  of  replication  entirely  from  the  treatment  of  concurrency  control 
and  recovery.  That  is,  we  are  able  to  consider  the  replication  issues  solely  in  the  context  of 
serial  systems.  We  prove  that  a  system  which  includes  the  new  replication  algorithm  and 
which  is  serial  at  the  level  of  the  individual  data  copies  “simulates”  (in  a  strong  sense)  a 
system  which  is  serial  at  the  level  of  the  logical  data  items.  In  particular,  it  “looks  the  same” 
to  the  user  transactions.  Since  both  systems  involved  in  this  simulation  are  serial  systems, 
the  simulation  proof  is  very  simple,  and  is  based  on  standard  assertional  techniques. 

Of  course,  systems  which  are  truly  serial  at  the  level  of  the  data  copies  are  of  little 
practical  interest.  However,  previous  work  on  nested  transaction  concurrency  control  and 
recovery  algorithms  [Mo,R,LM,FLMW]  has  produced  several  interesting  algorithms  which 
guarantee  that  a  system  appears  to  be  serial,  as  far  as  the  transactions  can  tell.  Combining 
any  of  these  algorithms  (at  the  copy  level)  with  the  new  replication  algorithm  yields  a 
combined  algorithm  which  appears  to  be  non-replicated  and  serial  (at  the  logical  data  item 
level),  as  far  as  the  user  transactions  can  tell. 

In  fact,  o".r  results  show  that  the  replication  algorithm  can  be  combined  with  any 
algorithm  which  guarantees  “serializability”  at  the  copy  level,  to  yield  a  system  which  is 
serializable  at  the  logical  item  level.  Thus,  our  work  formalizes  a  frequently  stated  informal 
claim  that  “quorum  consensus  works  with  any  correct  concurrency  control  algorithm.  As 
long  as  the  algorithm  produces  serializable  executions,  quorum  consensus  will  ensure  that 
the  effect  is  just  like  an  execution  on  a  single  copy  database”  [BHG]. 

The  presentation  and  proof  techniques  presented  here  work  so  well  that  we  expect  they 
will  be  of  general  use  in  simplifying  the  treatment  of  many  other  data  replication  algorithms 
as  well. 
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Related  work,  in  addition  to  the  papers  already  mentioned,  includes  some  previous 
attempts  at  rigorous  presentation  and  proof  of  replicated  data  algorithms.  Most  notable 
among  these  is  the  presentation  and  proof  given  by  Bernstein,  Hadzilacos,  and  Goodman 
[BHG]  of  Gifford’s  basic  algorithm.  This  work  is  based  on  serializabtlity  theory,  a  theory 
which  has  made  a  significant  contribution  to  the  understanding  of  concurrency  control. 
This  approach,  however,  does  not  appear  to  generalize  easily  to  the  case  where  nesting 
and  failures  are  allowed.  Also,  Herlihy  [He]  extends  Gifford’s  algorithm  to  accommodate 
abstract  data  types  and  offers  a  correctness  proof.  Again,  nesting  is  not  considered.  This 
thesis  is  part  of  a  larger  effort  to  unify  the  work  in  concurrency  control  and  recovery,  as 
well  as  extend  it  to  permit  nesting  [LM,FLMW,HLMW]. 

The  remainder  of  the  thesis  is  organized  as  follows.  In  Chapter  2,  we  introduce  the 
computation  model.  Then,  in  Chapter  3,  we  describe  the  generalized  version  of  Gifford’s 
algorithm  without  reconfiguration  and  prove  its  correctness.  In  Chapter  4,  we  expand  on 
these  results  to  give  the  description  and  correctness  proof  of  the  complete  algorithm  (i.e., 
including  reconfiguration).  For  both  versions  of  the  algorithm,  we  show  that  the  correctness 
of  interesting  non-serial  replicated  systems  follows  directly  from  these  results.  Chapter  5 
contains  a  summary  of  our  results  and  a  brief  discussion  of  possible  further  research. 
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Chapter  2 

The  Model 
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We  use  the  I/O  automaton  model,  due  to  Lynch- Merritt  [LM]  and  Lynch- Tuttle  [LT],  as  the 
formal  foundation  for  our  work.  We  model  components  of  a  system  by  (possibly  infinite- 
state)  nondeterministic  automata  that  have  operation  names  associated  with  their  state 
transitions.  Communication  among  automata  is  described  by  identifying  their  operations. 
We  only  prove  properties  of  finite  behavior,  so  a  simple  special  case  of  the  general  model  is 
sufficient.  Sections  2.1  and  2.2  provide  a  brief  introduction  to  I/O  automata  and  systems 
that  includes  the  definitions  from  [LM]  and  [LT]  that  are  relevant  to  this  work.  Then,  in 
Section  2.3,  we  extend  the  model  with  some  new  definitions  that  are  particularly  useful  for 
modeling  replicated  data  management  algorithms. 

2.1  I/O  Automata  and  Systems 

The  basic  components  of  the  model  are  I/O  automata.  An  I/O  automaton  A  has  components 
states(A),  start{A),  out(A),  in(A),  and  steps(A).  Here,  states(A)  is  a  set  of  states,  of  which 
a  subset  8tart(A)  is  designated  as  the  set  of  start  states.  The  next  two  components  are 
disjoint  sets:  out(A)  is  the  set  of  output  operations,  and  in(.4)  is  the  set  of  input  operations. 
The  union  of  these  two  sets  is  the  set  of  operations  of  the  automaton.  Finally,  steps(A)  is 
the  transition  relation  of  A,  which  is  a  set  of  triples  of  the  form  (s',x,s),  where  s'  and  s 
are  states,  and  x  is  an  operation.  This  triple  means  that  in  state  s',  the  automaton  can 
atomically  perform  operation  x  and  change  to  state  s.  An  element  of  the  transition  relation 
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is  called  a  step  of  A.  If  (s',x,s)  is  a  step  of  A,  we  say  that  x  is  enabled  in  s'. 

The  output  operations  are  intended  to  model  the  actions  that  are  triggered  by  the 
automaton  itself,  while  the  input  operations  model  the  actions  that  are  triggered  by  the 
environment  of  the  automaton.  We  require  the  following  condition,  which  says  that  an  I/O 
automaton  must  be  prepared  to  receive  any  input  operation  at  any  time. 

Input  Condition:  For  each  input  operation  x  and  each  state  s',  there  exist  a  state  s  and 
a  step  (s',  x,«). 


An  execution  of  A  is  a  finite  alternating  sequence  so,  Xi,  sj,  xj, sn  of  states  and  operar 
tions  of  A,  where  so  is  in  start{A)  and  each  subsequence  (s*,  xt+i,  8j+i)  is  in  steps(A)-  From 
any  execution,  we  can  extract  the  schedule,  which  is  the  subsequence  of  the  execution  that 
contains  only  the  operations  (e.g.,  xi,  X2, ...,  xn).  Because  transitions  to  different  states  may 
have  the  same  operation,  different  executions  may  have  the  same  schedule. 

If  S  is  any  set  of  schedules  (or  property  of  schedules),  then  automaton  A  is  said  to 
preserve  S  provided  that  the  following  holds.  If  a  =  a' it  is  any  schedule  of  A,  where  x  is 
an  output  operation  and  a'  is  in  S,  then  a  is  in  S.  That  is,  A  is  not  the  first  to  violate  the 
property  described  by  S. 

We  model  a  system  as  a  set  of  interacting  components,  each  of  which  is  an  I/O  automa¬ 
ton.  It  is  convenient  and  natural  to  view  systems  as  I/O  automata  as  well.  Thus,  we  define 
an  operation  that  composes  a  set  of  I/O  automata  to  yield  a  new  I/O  automaton. 

A  set  of  I/O  automata  may  be  composed  to  create  a  system  S ,  provided  that  the  sets  of 
output  operations  for  the  automata  are  disjoint.  Thus,  every  output  operation  in  S  will  be 
triggered  by  exactly  one  component.  The  system  S  is  itself  an  I/O  automaton.  A  state  of  the 
composed  automaton  is  a  tuple  of  states,  one  for  each  component,  and  the  start  states  are 
tuples  consisting  of  start  states  of  the  components.  The  set  of  operations  of  $,  ops(S),  is  the 
union  of  the  sets  of  operations  of  the  component  automata.  The  set  of  output  operations  of 
S ,  out{$),  is  likewise  the  union  of  the  sets  of  output  operations  of  the  component  automata. 
Finally,  the  set  of  input  operations  of  $ ,  in(S),  is  ops{S)-out(S),  the  set  of  operations  of  $ 
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that  ate  not  output  operations  of  S .  The  output  operations  of  a  system  are  intended  to  be 
exactly  those  that  are  triggered  by  components  of  the  system,  while  the  input  operations 
of  a  system  are  those  that  are  triggered  by  the  system’s  environment. 

The  triple  (s',  x,  s)  is  in  the  transition  relation  of  S  if  and  only  if  for  each  component 
automaton  A,  one  of  the  following  two  conditions  holds.  Either  x  is  an  operation  of  A, 
and  the  projection  of  the  step  onto  A  is  a  step  of  A,  or  else  x  is  not  an  operation  of  A, 
and  the  state  corresponding  to  A  in  tuple  s’  is  identical  to  the  state  corresponding  to  A 
in  tuple  s.  Thus,  each  operation  of  the  composed  automaton  is  an  operation  of  a  subset 
of  the  component  automata.  During  the  performance  of  an  operation  x  of  S,  each  of  the 
components  which  has  operation  x  carries  out  the  operation,  while  the  remainder  stay  in 
the  same  state.  Again,  the  operation  x  is  an  output  operation  of  the  composition  if  it  is  the 
output  operation  of  a  component  —  otherwise,  x  is  an  input  operation  of  the  composition. 

An  execution  of  a  system  is  defined  to  be  an  execution  of  the  composition  of  the  automata 
modeling  the  individual  system  components.  If  a  is  a  sequence  of  operations  of  a  system  $ 
with  component  A,  then  a\A  (read  V  restricted  to  A”)  is  the  subsequence  of  a  containing 
exactly  the  operations  of  A.  Clearly,  if  a  is  a  schedule  of  S,  then  a\A  is  a  schedule  of  A. 

The  following  lemma,  known  as  the  Composition  Lemma,  expresses  formally  the  notion 
that  an  operation  is  under  the  control  of  the  component  of  which  it  is  an  output. 

Lemma  1  Let  a '  be  a  schedule  of  a  system  S,  and  let  a  =  tr'x,  where  x  is  an  output 
,peration  of  component  A.  If  a\A  is  a  schedule  of  A,  then  a  is  a  schedule  of  5. 

Proof:  In  (LM],  ■ 

Let  a  be  a  schedule  of  system  S.  We  say  that  property  P  holds  after  a  iff  property  P 
holds  for  the  final  state  of  every  execution  of  $  whose  schedule  is  a.  We  say  that  property 
P  holds  forever  after  a  iff  property  P  holds  for  the  final  state  of  every  execution  of  $  whose 
schedule  has  a  as  a  prefix. 

Let  A  be  an  automaton  whose  transition  relation  is  restricted  so  that  if  (s',x,«x)  and 
(s',x,»2)  are  both  in  steps(^),  then  sj  =  s2-  If  A  has  a  unique  initial  state,  then  we  say 
that  A  is  a  state- deterministic  automaton.  That  is,  A  is  deterministic  in  the  sense  that  its 
state  is  a  function  of  its  schedule. 
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All  of  the  automata  that  we  define  explicitly  are  state-deterministic.  For  such  automata, 
we  will  freely  use  the  words  “ state  a  of  A  after  schedule  <r”  to  denote  the  unique  state  of  A 
resulting  from  the  execution  of  A  whose  schedule  is  o. 

2.2  Nested  Transaction  Systems 

To  model  nested  transaction  systems  we  use  a  system  type ,  which  is  a  tuple  (T, parent,  0,V). 
T  is  the  set  of  transaction  names  organized  into  a  tree  by  the  mapping  parent: T  — ►  T, 
with  To  as  the  root.  In  referring  to  this  tree,  we  use  traditional  terminology,  such  as 
child,  leaf,  least  common  ancestor  (lea),  ancestor  and  descendant.  (A  transaction  is  its  own 
ancestor  and  descendant.)  The  leaves  of  T  are  called  accesses.  The  set  0  is  a  partition  of 
the  set  of  accesses,  where  each  element  (class)  of  the  partition  contains  the  accesses  to  a 
particular  object;  each  element  of  0  denotes  its  corresponding  object.  Finally,  V  is  the  set 
of  values  that  may  be  returned  by  transactions.  The  tree  structure  is  known  in  advance 
by  all  the  components  of  the  system  and  can  be  thought  of  as  a  predefined  naming  scheme 
for  all  possible  transactions  that  might  ever  be  invoked.  In  general,  the  tree  is  an  infinite 
structure,  and  only  some  of  the  transactions  will  take  steps  in  any  given  execution. 

The  root  transaction  To  plays  a  special  role  in  this  theory.  The  root  models  the  envi¬ 
ronment  of  the  nested  transaction  system  (the  "external  world”)  from  which  requests  for 
transactions  originate  and  to  which  the  results  of  these  transactions  are  reported.  Since 
it  has  no  parent,  To  may  neither  commit  nor  abort.  The  classical  transactions  of  concur¬ 
rency  control  theory  (without  nesting)  appear  in  our  model  as  the  children  of  To-  (In  other 
work  on  nested  transactions,  such  os  Argus,  the  children  of  To  are  often  called  "top-level” 
transactions.)  Even  in  the  context  of  classical  theory  (with  no  additional  nesting)  it  is 
convenient  to  introduce  the  root  transaction  to  model  the  environment  in  which  the  rest  of 
the  transaction  system  runs,  with  operations  that  describe  the  invocation  and  return  of  the 
classical  transactions.  It  is  natural  to  reason  about  To  in  the  same  way  as  about  all  of  the 
other  transactions. 

The  internal  nodes  of  the  tree  model  transactions  whose  function  is  to  create  and  manage 
subtransactions,  but  not  to  access  data  directly.  The  only  transactions  which  actually  access 
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data  are  the  leaves  of  the  transaction  tree,  and  thus  they  are  called  “accesses” .  The  partition 
0  simply  identifies  those  transactions  which  access  the  same  object. 

The  systems  we  describe  are  serial  systems.  A  serial  system  is  the  composition  of  a  set  of 
I/O  automata.  This  set  contains  a  transaction  for  each  internal  node  of  the  transaction  tree, 
a  basic  object  for  each  element  of  0 ,  and  a  serial  scheduler  for  the  given  system  type.  The 
system  primitives  are  the  transaction  automata  and  the  basic  objects;  these  describe  user 
programs  and  data,  respectively.  The  serial  scheduler  controls  communication  between  the 
primitives,  and  thereby  defines  the  allowable  orders  in  which  the  primitives  may  take  steps. 
All  three  types  of  system  components  are  modelled  as  I/O  automata.  These  automata  are 
described  below.  (If  X  is  a  basic  object  associated  with  an  element  X  of  the  partition  0, 
and  T  is  an  access  in  X ,  we  write  T €accesses(X)  and  say  that  “T  is  an  access  to  X”.) 

Non-access  Transactions:  Transactions  are  modelled  as  I/O  automata.  In  modeling 
transactions,  we  consider  it  very  important  not  to  constrain  them  unnecessarily;  thus,  we 
do  not  want  to  require  that  they  be  expressible  as  programs  in  any  particular  high-level 
programming  language.  Modeling  the  transactions  as  I/O  automata  allows  us  to  state 
exactly  the  properties  that  are  needed,  without  introducing  unnecessary  restrictions  or 
complicated  semantics. 

A  non-access  transaction  T  is  modelled  as  an  I/O  automaton,  with  the  following  oper¬ 
ations: 

Input  operations:  CREATE(T) 

COMMIT(T’,v),  where  T’€children(T)  and  ve  V 
ABORT(T’),  where  T’Gchildren(T) 

Output  operations:  REQUEST-CREATE(T’),  where  T’Gchildren(T) 
REQUEST-COMMIT(T,v),  where  v€  V 

The  CREATE  input  operation  “wakes  up”  the  transaction.  The  REQUEST-CREATE 
output  operation  is  a  request  by  T  to  create  a  particular  child  transaction  1 .  The  COMMIT 

lNote  that  there  ie  no  provision  for  T  to  pase  information  to  its  child  in  this  request.  In  a  programming 
language,  T  might  be  permitted  to  pass  parameter  values  to  a  subtransaction.  Although  this  may  be  a 
convenient  descriptive  aid,  it  is  not  necessary  to  include  in  it  the  underlying  formal  model.  Instead,  we 
consider  transactions  that  have  different  input  parameters  to  be  different  transactions. 
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input  operation  reports  to  T  the  successful  completion  of  one  of  its  children,  and  returns 
a  value  recording  the  results  of  that  child’s  execution.  The  ABORT  input  operation  re¬ 
ports  to  T  the  unsuccessful  completion  of  one  of  its  children,  without  returning  any  other 
information.  We  call  COMMIT(T’,v),  for  any  v,  and  ABORT(T’)  return  operations  for 
transaction  T’.  The  REQUEST-COMMIT  operation  is  an  announcement  by  T  that  it  has 
finished  its  work,  and  includes  a  value  for  reporting  the  results  of  that  work  to  its  parent. 

It  is  convenient  to  use  two  separate  operations,  REQUEST-CREATE  and  CREATE,  to 
describe  what  takes  place  when  a  subtransaction  is  activated.  The  REQUEST-CREATE 
is  an  operation  of  the  transaction’s  parent,  while  the  actual  CREATE  takes  place  at  the 
subtransaction  itself.  In  actual  distributed  systems  such  as  Argus  [LiSj,  this  separation  does 
occur,  and  the  distinction  will  be  important  in  our  results  and  proofs.  Similar  remarks  hold 
for  the  REQUEST-COMMIT  and  COMMIT  operations,  which  occur  at  at  transaction  and 
its  parent,  respectively. 

We  leave  the  executions  of  particular  transaction  automata  largely  unspecified;  the 
choice  of  which  children  to  create,  and  what  value  to  return,  will  depend  on  the  particular 
implementation.  However,  it  is  convenient  to  assume  that  schedules  of  transaction  automata 
obey  certain  syntactic  constraints.  Thus,  transaction  automata  are  required  to  preserve 
well-formedness,  as  defined  below. 

We  recursively  define  well-formedness  for  sequences  of  operations  of  a  transaction  T. 
Namely,  the  empty  schedule  is  well-formed.  Also,  if  a  =  a' x  is  a  sequence  of  operations  of 
T,  where  x  is  a  single  operation,  then  a  is  well-formed  provided  that  o'  is  well-formed,  and 
the  following  hold: 

•  If  x  is  CREATE(T),  then 

I.  there  is  no  CREATE(T)  in  a1. 

•  If  x  is  COMMIT(T’,v)  or  ABORT(T’)  for  a  child  T’  of  T,  then 

1.  REQUEST-CREATE(T’)  appears  in  a'  and 

2.  there  is  no  return  operation  for  T’  in  o'. 
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•  If  w  is  REQUEST-CREATE(T’)  for  a  child  T’  of  T,  then 

1.  there  i s  no  REQUEST-CREATE(T’)  in  ol 

2.  there  is  no  REQUEST-COMMIT  for  T  in  ol  and 

3.  CREATE(T)  appears  in  ol. 

•  If  *  is  a  REQUEST-COMMIT  for  T,  then 

1.  there  is  no  REQUEST-COMMIT  for  T  in  ol  and 

2.  CREATE(T)  appears  in  ol . 

These  restrictions  are  very  basic;  they  simply  say  that  a  transaction  is  created  at  most 
once,  does  not  receive  repeated  (or  conflicting)  notification  of  the  fates  of  its  children,  and 
does  not  receive  information  about  the  fate  of  any  child  whose  creation  it  has  not  requested. 
Also,  a  transaction  performs  output  operations  neither  before  it  is  created  nor  after  it  has 
requested  to  commit,  and  a  transaction  does  not  request  the  creation  of  any  given  child 
more  than  once. 

Except  for  these  minimal  conditions,  there  are  no  restrictions  on  allowable  transaction 
behavior.  For  example,  the  model  allows  a  transaction  to  request  to  commit  without  dis¬ 
covering  the  fate  of  all  subtransactions  whose  creation  it  has  requested.  Also,  a  transaction 
can  request  creation  of  new  subtransactions  at  any  time,  without  regard  to  its  state  of 
knowledge  about  subtransactions  whose  creation  it  has  previously  requested.  Particular 
programming  languages  may  choose  to  impose  additional  restrictions  on  transaction  behav¬ 
ior.  (An  example  is  Argus,  which  suspends  activity  in  transactions  until  subtransactions 
complete.)  However,  our  results  do  not  require  such  restrictions. 

Basic  Objects:  Since  access  transactions  model  abstract  operations  on  shared  data  ob¬ 
jects,  we  associate  a  single  I/O  automaton  with  each  object,  rather  than  one  with  each 
access.  The  operations  of  a  basic  object  automaton  X  are  the  invocation  and  return  opera¬ 
tions  of  the  its  access  transactions: 
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Input  operations:  CREATE(T),  for  T  €  accesaes(X) 

Output  operations:  REQUEST-COMMIT(T,v),  for  T  £  accesses(X)  and  v  £  V 

Let  a  be  a  sequence  of  operations  of  basic  object  X.  Then  an  access  T  to  X  is  said  to 
be  pending  in  a  provided  that  there  is  a  CREATE(T)  but  no  REQUEST-COMMIT  for  T 
in  a. 

It  is  convenient  to  require  that  schedules  of  basic  objects  satisfy  certain  syntactic  con¬ 
ditions.  Thus,  each  basic  object  is  required  to  preserve  well-formedness,  which  is  defined 
recursively  as  follows. 

The  empty  schedule  is  well-formed.  If  a  ~  a'x  is  a  sequence  of  operations  of  basic  object 
X,  where  n  is  a  single  operation,  then  a  is  well-formed  provided  that  a'  is  well-formed,  and 
the  following  conditions  hold. 

•  If  x  is  CREATE(T)  then 

1.  there  is  no  CREATE(T)  in  a',  and 

2.  there  are  no  pending  accesses  in  a>. 

•  If  w  is  a  REQUEST-COMMIT  for  T  then 

1.  there  is  no  REQUEST-COMMIT  for  T  in  a',  and 

2.  CREATE(T)  appears  in  ci . 

That  is,  the  schedules  of  basic  objects  are  restricted  to  consist  of  alternating  CREATE 
and  REQUEST-COMMIT  operations,  starting  with  a  CREATE,  and  with  each  (CREATE, 
REQUEST-COMMIT)  pair  having  the  same  access  transaction,  where  each  access  transac¬ 
tion  has  at  most  one  CREATE. 

Serial  Scheduler:  The  serial  scheduler  is  a  fully  specified  automaton  The  serial  sched¬ 
uler  can  choose  nondetermimsticaily  to  abort  any  transaction  T  after  parent(T)  has  issued 
a  REQUEST-CREATE(T)  operation,  as  long  as  T  has  not  actually  been  created  Thus, 
the  “semantics”  of  an  abort(T)  operation  are  that  T  was  never  created  Furthermore,  a 
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transaction  can  only  be  created  if  (1)  it  has  not  already  been  created,  (2)  its  parent  has 
requested  its  creation,  and  (3)  all  of  its  created  siblings  have  returned.  In  other  words, 
the  scheduler  runs  transactions  according  to  a  depth-first  traversal  of  the  transaction  tree. 
Finally,  the  scheduler  cannot  commit  a  transaction  until  all  of  the  transaction’s  children 
have  returned.  The  formal  definition  of  the  serial  scheduler,  adapted  from  [LM.FLMW],  is 
as  follows. 

The  state  of  the  serial  scheduler  has  components  create-requested,  created,  commit- 
requested,  committed,  aborted,  and  returned.  Commit-requested  is  a  set  of  (transac¬ 
tion, value)  pairs,  and  the  rest  are  sets  of  transaction  names.  Initially,  create-requested 
=  (To),  and  the  other  sets  are  empty. 

The  steps  of  the  transition  relation  for  each  automaton  we  define  are  exactly  those  triples 
(s',  x,  s)  satisfying  the  pre-  and  postconditions  listed,  where  w  is  the  indicated  operation.  If 
a  component  of  s  i.  not  mentioned  in  the  postcondition,  then  it  is  taken  to  be  the  same  in 
s  as  in  s'. 

Input  operations:  REQUEST-CREATE(T) 

REQUEST-COMMIT(T,v) 

Output  operations:  CREATE(T) 

COMMIT(T.v) 

ABORT(T) 

•  REQUEST-CREATE(T) 

Postcondition:  create-requested  (s)  =  create-requested  (s  ’)  u  {T} 

•  REQUEST-COMMIT(T.v) 

Postcondition:  commit-requested(s)  =  commit- requested(s  ’)  U  {(T,v)} 

•  CREATE(T) 

Precondition:  T  €  create-requested(s’)  (created(s  ’)U  aborted  (s’)) 

siblings(T)  n  created(s’)  C  returned(s’) 

Postcondition:  created(s)  =•  created(s’)  u  (T) 

•  COMMIT(T.v) 
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Precondition:  (T,v)  €  commit-requested(s’) 

T  £  returned  (s’) 

children(T)  H  create-requested(s’)  C  returned(s’) 
Postcondition:  committed(s)  =  committed(s’)  U  {(T,v)} 
returned(s)  =  returned(s  ’)  u  {T> 


•  ABORT(T) 

Precondition:  T  6  create-requested(s’)  —  (created(s  ’)U  aborted(s’)) 

siblings(T)  n  created(s’)  C  returned(s’) 

Postcondition:  aborted  (s)  =  aborted  (s  ’)  u  {T} 
returned(s)  =  returned  (s  ’)  u  {T> 


Let  S  be  a  serial  system,  and  let  a  be  a  sequence  of  operations  of  S.  We  say  that  a  is 
well-formed  iff  its  projection  at  every  primitive  is  well- formed.  If  a  is  a  schedule  of  S,  then 
a  is  a  aerial  schedule.  In  [LM],  it  is  shown  that  all  serial  schedules  are  well-formed. 

Let  5  be  a  serial  system,  and  let  -7  be  an  arbitrary  sequence  of  operations.  We  say  that 
7  is  aerially  correct  with  reaped  to  S  for  transaction  T  provided  that  t|T  =  <r|T  for  some 
schedule  a  of  S. 


2.3  Model  Extensions  for  Replicated  Data  Systems 

In  this  section,  we  add  to  the  model  some  definitions  that  are  useful  for  formalizing  and 
understanding  replicated  data  management  algorithms. 

In  order  to  understand  why  these  particular  definitions  are  useful,  it  is  helpful  to  keep 
in  mind  the  general  proof  strategy  we  use.  As  explained  in  Chapter  1,  for  each  algorithm 
considered  we  first  construct  a  serial  system  in  which  database  items  are  implemented 
as  multiple  replicas,  where  access  to  the  replicas  is  controlled  by  the  replication  algorithm. 
Then,  we  construct  a  serial  system  (with  the  same  user  transactions)  in  which  each  database 
item  is  implemented  as  a  single  replica.  Finally,  we  prove  that  each  user  transaction1  in 
the  replicated  system  has  the  same  execution  as  its  corresponding  transaction  in  the  non- 
replicated  system. 

’For  each  system,  we  will  define  formally  what  is  meant  by  a  u»er  transaction  in  term*  of  the  system 
type  In  general,  however,  one  may  think  of  user  transactions  as  all  the  non-access  transactions  that  do  not 
model  part  of  the  replication  algorithm  As  a  rule,  user  transactions  are  those  transactions  which  we  do  not 
describe  with  fully-specified  automata. 
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We  have  already  discussed  serial  systems  and  provided  formal  definitions  for  transac¬ 
tions,  accesses,  and  executions.  However,  in  order  to  give  a  more  precise  meaning  to  the 
above  description  of  our  proof  strategy,  we  need  formal  definitions  for  “database  item,” 
“replica,”  and  “corresponding  transaction.” 

Logical  Data  Items:  We  refer  to  database  items  as  “logical  data  items*  to  distinguish 
them  from  their  physical  counterparts,  the  replicas. 

A  logical  data  item  x  is  a  variable,  whose  type  is  the  tuple  (V4, i,).  The  set  Vx  is  the 
domain  of  possible  values  for  x,  and  t,  €  V„  i a  the  initial  value  of  x.  We  require  that  a 
special  undefined  value,  nil,  be  an  element  01  Vt,  and  that  a  special  place-holder  symbol, 
1_,  not  be  an  element  of  Vt. 

Read-write  Objects:  Each  replica  is  modelled  as  a  fully  specified  basic  object  called 
a  read-write  object,  where  the  domain  and  initial  value  depend  upon  the  particular  data 
replication  management  algorithm  and  the  type  of  the  logical  data  item.  Before  we  can 
specify  read-write  object  automata,  we  require  the  following  definition. 

If  d  and  <t  are  data  values  from  a  domain  Dujl},  then  do  if  is  defined  to  be  d  if  df 
is  _L,  and  <f  otherwise.  If  t  —  ( d\,di , ...)  and  t'  =  ...)  are  tuples  of  the  same  type, 

then  we  define  tot'  to  be  the  tuple  (d\  o<tx,diO<t7, ...).  This  operator  allows  us  to  overwrite 
certain  components  of  a  tuple  while  leaving  the  other  components  unchanged. 

We  now  define  the  concept  of  a  read- write  object  with  domain  D  and  initial  value  d. 

The  state  of  a  read-write  object  O  with  domain  D  with  initial  value  d  6  D  consists  of 
two  components,  active  and  data.  The  variable  active  (initially  nil)  holds  the  name  of  the 
current  access  to  O.  Data  holds  an  element  of  D  (initially  d).  Every  read-write  object  has  a 
set  of  accesses,  denoted  acceeses(O).  Each  access  T  to  a  read-write  object  has  the  attributes 
kind( T)  €  (read, write}  and  data(T)  e  D.  When  kind(T)  =  write,  data(T)  is  the  data  to 
be  written. 

Input  operations:  CREATE(T),  where  T  €  accesses (O) 

Output  operations:  REQUEST-COMMIT(T,v),  where  T  £  accesses (0) 


24 


CHAPTER  2.  THE  MODEL 


I 


)S 

E> 


« 

z 

V 

V 


•  CREATE(T),  for  T  G  accesses(O) 

Postcondition:  active(s)  =  T 

•  REQUEST-COMMIT(T,v),  for  kind(T)  =  read 

Precondition:  active(a’)  =  T 

v  =  data(s’) 

Postcondition:  active(s)  =  nil 

•  REQUEST-COMMIT(T,v),  for  kind(T)  =  write  and  data(T)  =  d 

Precondition:  active(s’)  =  T 

v  =  nil 

Postcondition:  data(s)  =  data(s’)  o  d 
active(s)  =  nil 

A  read-write  object  accepts  read  and  write  accesses.  For  read  accesses,  it  returns  the 
value  in  the  data  component  of  its  state.  For  write  accesses,  it  applies  the  write  value  to 
its  data  value  using  the  o  operator.  For  example,  if  its  current  data  value  is  (a,  b)  and  it 
processes  a  write  access  with  data  (c,J_),  the  resulting  data  component  of  its  state  will  be 

M>. 

If  TGaccesses(O),  we  say  that  0(T)=0.  That  is,  we  use  O(T)  to  denote  the  read-write 
object  to  which  T  is  an  access. 

lemma  2  Read-write  objects  are  basic  objects. 

Proof:  It  suffices  to  show  that  read-write  objects  preserve  well-formedness  of  schedules. 
Let  O  be  a  read-write  object.  Let  a  =  a'jr  be  a  schedule  of  O,  where  v  =  REQUEST- 
COMMIT(T,v),  and  assume  that  a'  is  well-formed.  We  must  show  that:  (1)  CREATE(T) 
occurs  in  a',  and  (2)  no  REQUEST-COMMIT  for  T  occurs  in  a'.  These  properties  are 
guaranteed  by  the  use  of  the  variable  active.  A  precondition  for  REQUEST-COMMIT 
for  T  is  that  active  =  T.  Since  only  a  CREATE(T)  can  cause  active  to  equal  T,  part  (1) 
holds.  Assume,  for  contradiction,  that  a  REQUEST-COMMIT  for  T  occurs  in  a1.  By  part 
(1),  the  REQUEST-COMMIT  for  T  must  occur  after  a  CREATE(T).  A  postcondition  of 
REQUEST-COMMIT  for  T  is  that  active  =  nil.  Therefore,  the  state  of  O  after  a'  has  active 
^  T,  because  well-formedness  implies  that  a1  contains  at  most  one  CREATE(T)  operation. 
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However,  if  active  #  T,  then  x  (REQUEST-COMMIT  for  T)  is  not  enabled  in  O  after  ol . 
But  a'w  is  a  schedule  of  O,  giving  us  a  contradiction.  Thus,  part  (2)  holds.  ■ 

Extensions  of  Systems:  We  want  to  define  formally  the  notion  of  “corresponding  trans¬ 
actions”  so  that  we  can  be  precise  in  our  comparisons  of  each  pair  of  replicated  and  non- 
replicated  systems.  That  is,  for  certain  pairs  of  systems,  we  would  like  a  function  that  maps 
each  transaction  of  one  system  to  some  transaction  in  the  other  system.  In  order  for  this 
function  to  be  well-defined,  we  must  impose  certain  restrictions  on  the  system  types  of  the 
two  systems. 

Let  S'  and  S  be  two  systems  with  system  types  S'  and  E,  respectively.  System  type  E' 
is  an  extension  of  system  type  E  if  the  transaction  tree  of  E  is  a  subgraph  of  the  transaction 
tree  of  E’  and  both  trees  have  the  same  root.  If  E'  is  an  extension  of  E,  then  we  say  that 
system  S'  is  an  extension  of  system  S. 

If  system  S'  is  an  extension  of  system  S,  relating  the  transactions  in  the  two  systems 
is  easy.  We  define  function  Tgs'  ■  T$  — ►  Ts>  to  map  transactions  in  S  to  their  same-named 
transactions  in  S'.  The  inverse,  ?s'S,  is  a  partial  function  unless  S  and  S'  have  the  same 
transaction  tree. 

Configurations:  As  a  final  addition  to  the  model,  we  introduce  the  following  general 
definitions,  which  are  central  to  the  algorithms  we  study. 

Let  S  be  any  arbitrary  set,  and  let  Q  be  the  power  set  2s .  We  define  configurations^) 
to  be  the  set  of  all  purs  of  the  form  (r,  w),  where  r,  w  C  Q.  (We  sometimes  refer  to  r  and 
w  as  sets  of  read-quorums  and  write-quorums,  respectively.)  The  set  legal(S)  is  defined 
to  be  the  set  of  all  elements  (r,w)  of  configurations  (5)  such  that  every  element  of  r  has  a 
non-empty  intersection  with  every  element  of  u/. 

We  say  that  every  element  of  configurations(S)  is  a  configuration  of  S,  and  that  every 
element  of  legal(S)  is  a  legal  configuration  of  S. 


Notation:  We  let  N  denote  the  set  of  non-negative  integers  (i.e.,  {0,1,2,...}). 
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Chapter  3 

Fixed  Quorum  Consensus 


In  this  chapter,  we  formalize  and  prove  the  correctness  of  a  generalized  version  of  Gifford’s 
algorithm  without  reconfiguration,  as  described  in  the  introduction.  In  Section  3.1,  we 
define  system  B,  a  replicated  aerial  ayatem  that  uses  the  fixed  quorum  consensus  algorithm 
to  manage  replicas,  and  prove  some  properties  of  its  schedules.  Then,  in  Section  3.2, 
we  define  a  corresponding  non-replieated  aerial  ayatem,  named  system  A.  We  prove  the 
correctness  of  the  fixed  quorum  consensus  algorithm  in  Section  3.3  by  showing  that  system 
B  simulates  system  A  in  a  strong  sense.  Finally,  in  Section  3.4,  we  show  that  non-serial 
replicated  systems  are  correct. 

3.1  Replicated  Serial  System 

The  replicated  serial  system  defined  in  this  section  is  an  ordinary  serial  system  in  which 
certain  logical  data  items  are  replicated.  That  is,  they  are  implemented  as  several  basic 
objects  (replicas),  rather  than  just  one.  We  impose  a  restriction  on  the  transaction  tree  so 
that  all  accesses  to  the  replicas  are  the  children  of  transaction  manager  automata  (TMs), 
which  we  define  explicitly.  The  TMs  model  the  Quorum  Consensus  algorithm  itself.  We 
model  the  read  and  write  operations  of  the  algorithm  by  providing  two  kinds  of  TMs,  read- 
TMs  and  write-TMs.  We  place  no  restrictions  on  the  remaining  automata,  except  that  they 
preserve  well-formedness.  The  system  is  formally  defined  as  follows. 

Fix  J,  a  set  of  logical  data  items.  We  define  system  B  to  be  a  serial  system  of  type 
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(7,  parent,  0,V).  For  each  element  x  of  I ,  we  define: 

•  dm(x),  a  subset  of  0, 

•  acc(x),  a  subset  of  the  accesses  in  T 

•  tmr(x)  and  disjoint  subsets  of  the  non-accesses  in  T, 

•  config(x),  a  legal  configuration  of  dm(z). 

Let  tm(x)  =  tmr(x)  U  tmw(x).  We  require  that  acc(x)  is  exactly  the  set  of  all  accesses 
to  objects  in  dm{x).  In  our  replicated  serial  system,  the  replicas  for  x  will  be  associated 
with  the  members  of  dm[x),  and  the  logical  accesses  to  x  will  be  managed  by  automata 
associated  with  the  members  of  tm[x).  Since  we  want  all  accesses  to  replicas  for  x  to  be 
controlled  by  the  replication  algorithm,  we  require  that  Te  acc(x)  iff  parent(T)e  fm(r). 
Finally,  for  all  pairs  x,  y  €  I ,  we  require  that  dm(x)  n  dm(y)  —  0. 

We  define  the  user  transactions  in  B  to  be  the  set  of  non-access  transactions  in  T  that 
are  not  in  tm(x)  for  all  x  €  J.  We  refer  to  accesses  in  acc(z)  for  all  x  €  I  as  replica  accesses, 
and  to  the  remaining  accesses  in  T  as  non-replica  accesses. 

Figure  3.1  provides  an  example  of  a  possible  transaction  tree  for  system  B. 

In  system  B,  each  member  of  dm{x)  has  a  corresponding  data  manager  automaton  (DM) 
for  x,  each  member  of  tmT(x)  has  an  associated  read-TM  automaton  for  x,  and  each  member 
of  tmw(x )  has  an  associated  write-TM  automaton  for  x.  From  the  restrictions  on  the  system 
type,  then,  the  members  of  acc(x)  are  the  accesses  to  the  DMs  for  x.  Furthermore,  the 
accesses  to  DMs  for  x  are  exactly  the  children  of  the  TMs  for  x.  DMs  and  TMs  for  x  are 
described  below. 

Data  Managers:  The  set  of  data  managers  for  logical  data  item  x  models  the  set  of 
physical  replicas  of  x.  Each  DM  is  a  read-write  object  that  keeps  a  version-number  and  a 
value  for  x.  The  formal  definition  follows. 

If  x  is  a  logical  data  item,  a  DM  for  x  is  a  read-write  object  over  domain  Dz  —  N  xVx  with 
initial  data  {0,»x).  We  refer  to  each  member  of  Dx  as  a  (version-number, value)  pair.  (For 
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Figure  3.1:  A  possible  transaction  tree  for  system  B.  Transactions  are  labeled  as  follows: 

U  =  user  transaction;  TM  =  transaction  manager;  a,  b  =  non-replica  accesses;  —  replica 
access  to  replica  1  of  logical  data  item  x,  etc. 

v€  Dx,  we  use  the  record  notation  v. version-number  and  v.value  to  refer  to  the  components 
of  v.) 

Lemma  3  DMs  are  basic  objects. 

Proof:  Immediate  from  Lemma  2.  ® 

Recall  that  we  have  restricted  the  system  type  of  B  so  that  accesses  to  DMs  for  x  are 
invoked  only  by  TMs  for  x.  We  now  define  read-TMs  and  write-TMs  for  x. 
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Read  TMs:  Let  x  be  a  logical  data  item  in  I.  The  purpose  of  a  read-TM  for  x  is  to 
perform  a  logical  read  access  to  x.  A  read-TM  for  x  invokes  read  accesses  to  multiple  DMs 
for  x.  It  then  returns  the  “current”  value  of  x,  which  it  calculates  from  the  information 
returned  by  the  read  accesses.  In  Lemma  8,  we  show  that  read-TMs  in  system  B  do,  in 
fact,  return  the  proper  value  of  x.  That  is,  a  read-TM  returns  the  value  that  would  be 
expected,  given  the  sequence  of  logical  write  accesses  to  x  that  precedes  its  invocation. 

A  read-TM  T  for  x  has  state  components  awake,  data,  requested,  and  read,  where  awake 
is  a  boolean  value,  data  is  a  value  in  the  domain  Dx ,  requested  is  a  subset  of  occ(i),  and 
read  is  a  subset  of  dm(x).  Initially,  data  is  (0,»*),  awake  is  false,  and  requested  and  read 
are  both  empty. 

Note:  Whenever  an  undefined  variable  (for  example,  q  in  the  REQUEST-COMMIT 
operation  of  the  following  automaton)  appears  in  the  pre-  and/or  postconditions  for  an 
operation,  then  that  variable  has  an  implicit  existential  quantifier  (i.e.,  there  exists  a  q  such 
that...). 

Input  operations:  CREATE(T) 

COMMIT(T’,v),  where  T’  G  children(T)  and  v  G  Dx 
ABORT(T’),  where  T’  G  children(T) 

Output  operations:  REQUEST-CREATE(T’),  where  T’  G  children(T) 
REQUEST-COMMIT(T.v) ,  where  v  g  Dx 

•  CREATE(T) 

Postcondition:  awake(s)  =  true 

•  REQUEST-CREATE(T’),  where  kind(T’)  =  read 

Precondition:  awake(s’)  =  true 

T’  requested(s’) 

Postcondition:  requested(s)  =  requested(s  ’)  U  {T’> 

•  COMMIT(T’.v) 

Postcondition:  read(s)  =  read(s’)  U  {O(T’)} 

if  v. version-number  >  data(s’). version-number  then  data(s)  =  v 

•  ABORT(T’) 

Postcondition:  (no  change) 
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•  REQUEST-COMMIT(T.v) 

Precondition:  awake(s’)  =  true 

q  €  eonfig(x).r 
q  C  read(s’) 
v  =  data(s’).  value 
Postcondition:  awake(s)  =  false 

A  read-TM  collects  data  from  some  number  of  DMs  for  x,  always  keeping  the  data  from 
the  DM  with  the  highest  version  number  seen  so  far.  When  a  read-quorum  of  DMs  has 
been  seen,  the  read-TM  may  request  to  commit  and  return  its  data. 

It  is  interesting  to  note  the  extensive  use  of  nondeterminism  in  this  algorithm.  For 
example,  the  read-TM  does  not  set  out  to  access  any  particular  read-quorum  in  the  con¬ 
figuration.  Rather,  the  read-TM  simply  invokes  any  number  of  accesses  to  any  of  the  DMs 
until  it  happens  to  notice  that  COMMIT  operations  have  been  received  from  some  read- 
quorum  of  DMs.  Also,  since  it  is  not  necessary  for  correctness  (as  opposed  to  efficiency)  for 
the  read-TM  to  remember  which  of  its  children  have  aborted,  the  ABORT(T’)  operation 
has  no  postconditions. 

The  nondeterminism  allows  for  greater  generality  of  our  results.  However,  one  would 
not  want  to  implement  read-TMs  this  loosely  in  a  real  system.  For  the  sake  of  efficiency,  one 
would  want  to  limit  the  number  of  accesses  invoked  by  a  read-TM.  For  example,  one  would 
want  the  read-TM  to  invoke  accesses  with  some  particular  read-quorum  in  mind.  Also,  one 
would  not  want  the  read-TM  to  invoke  an  access  to  a  DM  from  which  it  has  already  received 
information.  Similarly,  one  might  not  want  the  read-TM  to  invoke  an  access  to  a  DM  if 
several  previous  accesses  to  that  DM  have  aborted.  The  important  point,  however,  is  that 
all  of  our  results  apply  even  if  such  heuristics  are  added.  Our  proofs  depend  only  upon  the 
fact  that  all  operations  performed  satisfy  the  preconditions  and  postconditions  we  define. 

Write  TMs:  Let  x  be  a  logical  data  item  in  7.  The  purpose  of  a  write-TM  for  x  is 
to  perform  a  logical  write  access  to  x.  The  formal  description  of  a  write-TM  automaton 
follows. 

A  write-TM  T  for  x  has  state  components  awake,  data,  read-requested,  write- requested, 
read  and  written,  where  awake  is  a  boolean  variable,  data  is  an  element  of  Dx,  read- 
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requested  and  write-requested  are  subsets  of  aec(x),  and  read  and  written  are  subsets  of 
dm(x).  Initially,  data  =  (0,  »„),  awake  is  false,  and  the  sets  are  empty.  Every  write-TM 
T  for  x  has  an  associated  value  value(T)  G  Vt,  the  value  to  be  written  to  the  logical  data 
item. 

Input  operations:  CREATE(T) 

COMMIT(T’,v),  where  T’  G  children(T)  and  vG  Dt 
ABORT(T’),  where  T’  G  children(T) 

Output  operations:  REQUEST-CREATE(T’),  where  T*  G  children(T) 
REQUEST-COMMIT(T.v),  where  v  =  nil 

•  CREATE(T) 

Postcondition:  awake(s)  =  true 

•  REQUEST-CREATE(T’) ,  where  kind(T’)  =  read 

Precondition:  awake(s’)  =  true 

T’  £  read- requested  (s’) 

Postcondition:  read-requested(s)  =  read-requested  (s’)  U  (T’} 

•  COMMIT(T’,v),  where  kind(T’)  =  read 

Postcondition:  if  write-requested(s’)  =  {}  then 
read(s)  =  read(s’)  U  (O(T’)} 

if  v. version- number  >  data(s’). version-number  then 
data(s).  version-number  =  v.  version-number 

•  REQUEST-CREATE(T’),  where  kind(T’)  =  write  and  data(T’)  =  d 

Precondition:  awake(s’)  =  true 

q  G  config(x).r 
q  C  read(s’) 

d  =  (data(s’).version-number+I,value(T)) 

T’  £  write- requested  (s’) 

Postcondition,  write-requested(s)  —  write-requested(s  ’)  u  {T-} 

•  COMMIT(T’,v),  where  kind(T’)  =  write 

Postcondition:  written(s)  =  written(s  ’)  U  (O(T’)} 


•  ABORT(T’) 

Postcondition:  (no  change) 
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!  •  REQUEST-COMMIT(T,v) 

1  Precondition:  awake  =  true 

v  =  nil 

q  e  config(z).w 
q  C  written(s') 

Postcondition  awake  =  false 

A  write-TM  invokes  read  accesses  to  some  number  of  DMs  for  x,  keeping  track  of 
the  highest  version  number  returned.  Once  information  from  a  read-quorum  of  DMs  has 
been  collected,  the  write-TM  may  begin  invoking  write  accesses.  (See  the  REQUEST- 
CREATE(T’)  operation  )  The  version-number  of  each  write  access  invoked  is  one  greater 
than  the  version-number  in  the  data  component  of  the  write-TM’s  state,  and  the  value  of 
each  write  access  invoked  is  value(T)  Once  COMMIT  operations  have  been  received  from 
a  write-quorum  of  DMs,  the  write-TM  may  request  to  commit. 

It  is  possible  that  some  read  accesses  to  the  DMs  may  not  commit  until  after  the  write- 
TM  has  already  invoked  one  or  more  write  accesses.  Thus,  some  read  accesses  may  actually 
return  the  data  that  was  written  to  the  DMs  on  behalf  of  the  write-TM  itself  Therefore, 
in  order  to  prevent  the  write-TM  from  seeing  the  data  it  wrote  and  incorrectly  increasing 
its  version-number,  the  COMMIT  operation  for  read  accesses  is  defined  so  that  the  state  of 
the  write-TM  is  modified  only  if  no  write  accesses  have  been  invoked 

Our  discussion  of  the  nondeterminism  in  read-TMs  also  applies  to  write-TMs,  as  well 
as  to  all  other  automata  we  define. 

Lemma  4  TMs  are  transactions. 

Proof:  Let  T  be  one  of  these  automata.  It  suffices  to  show  that  T  preserves  well- 

formedness  Let  a  —  a'w  be  a  schedule  of  T  where  *  is  an  output  operation,  and  assume  that 
a'  is  well-formed  We  need  to  show  that  (1)  CREATE(T)  occurs  in  a' ,  (2)  no  REQUEST- 
COMMIT  operation  for  T  occurs  in  a',  and  (3)  if  *  is  a  REQUEST-CREATE(T’)  operation, 
then  no  REQUEST-CREATE(T’)  occurs  in  <V  By  the  definition  of  T,  no  output  operation 
can  be  issued  if  awake  -  false  and  only  the  CREATE  operation  can  set  awake  to  true 
Therefore,  part  (1)  is  true  Only  the  REQUEST-COMMIT  operation  can  set  awake  to  false 
and  by  definition  of  well-formed  schedule,  o'  car.  contain  at  most  one  CREATE  operation 
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(Once  awake  becomes  false,  it  remains  false  forever.)  Therefore,  part  (2)  holds.  Whenever 
a  REQUEST-CREATE(T’)  operation  is  performed,  that  fact  is  remembered  permanently 
in  the  state  of  T.  A  precondition  for  REQUEST-CREATE(T’)  is  that  T*  has  not  previously 
been  created  by  T  Since  T=  parent(T’),  only  T  may  issue  a  REQUEST-CREATE(T’) 
operation.  Therefore,  part  (3)  holds.  ■ 

Lemma  5  Schedules  of  system  B  are  well-formed. 

Proof:  By  Lemmas  3  and  4,  DMs  are  basic  objects  and  TMs  are  transactions.  Therefore, 
system  B  is  a  serial  system.  In  [LM|,  it  is  proved  that  all  schedules  of  serial  systems  are 
well-formed  ■ 

The  following  definitions  are  useful  for  describing  the  logical  accesses  to  the  logical  data 
items  in  system  B  and  for  setting  up  inductive  arguments  about  these  logical  accesses 

Access  sequence:  This  definition  formalizes  the  intuitive  notion  of  a  sequence  of  logical 
accesses  to  z. 

Let  0  be  a  sequence  of  operations  of  system  B,  and  let  z  be  a  logical  data  item  in  I 
Then  the  access  sequence  of  z  in  0,  denoted  access(z,  0),  is  defined  to  be  the  subsequence  of 
0  containing  the  CREATE  and  REQUEST-COMMIT  operations  for  the  members  of  tm(z) 

Logical  state:  The  following  definition  formalizes  the  intuitive  notion  of  the  “current 
state”  of  a  logical  data  item,  the  expected  return  value  of  a  logical  read 

Let  0  be  a  sequence  of  operations  of  system  B,  and  let  z  be  a  logical  data  item 
in  7  The  logical  state  of  z  after  0,  denoted  logical-state(z,  0),  is  defined  to  be  either 
value{T)  if  REQUEST-COMMlT(T.v)  is  the  last  REQUEST-COMMIT  operation  for  a 
write-TM  in  access(z, /?),  or  i,  if  no  REQUEST-COMMIT  operation  for  a  write-TM  occurs 
in  access(z,  0) 

Current  version  number:  Let  0  be  a  sequence  of  operations  of  system  /?,  and  let 
z  be  a  logical  data  item  in  I  Let  lasi'z,0)  denote  the  subset  of  acc(x)  such  that  for 
each  member  T  of  last(z,/J),  REQUEST-COMMIT  for  T  is  the  last  REQUEST-COMMIT 
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operation  for  a  write  access  to  O(T)  in  9  1  The  current  version  number  of  z  after  9,  denoted 
current-vn(z,  9),  is  defined  as  follows  If  laat(x,0)  is  non-empty,  then  current-vn(x,  9)  is 
|  the  maximum  over  all  Telast(x,  9)  of  data(T)  version-number  Otherwise,  current-vn(x,  9) 

-  0 

Lemma  6  If  9  i»  a  schedule  of  B  and  x  is  a  logical  data  item  in  I,  then  access(x,5) 
begins  with  a  CREATE  operation  for  some  TM  in  lm(x)  and  continues  alternately  with 
REQUEST-COMMIT  and  CREATE  operations  for  TMs  in  fm(x)  such  that  each  REQUEST 
COMMIT  for  T  is  preceded  immediately  by  a  CREATE(T)  operation 

Proof  By  definition,  acces s(z,9)  contains  only  CREATE  and  REQl  EST-COMMIT 
operations  for  TMs  in  tm(z)  By  Lemma  5,  9  is  a  well-formed  schedule,  so  each  REQUEST 
COMMIT  for  T  must  be  preceded  by  a  CREATE(T)  operation  Finally,  since  9  t»  a  serial 
schedule,  all  operations  for  a  given  transaction  must  be  contiguous  ■ 

Lemma  7  Let  x  be  a  logical  data  item,  and  let  9  be  a  schedule  of  B  Then  the  following 
property  holds  after  9  The  highest  version  number  among  the  states  of  all  DMs  in  dm(z) 
is  current-vn(x,  9) 

Proof  Since  DMs  are  read-write  objects,  the  only  operation  that  can  change  the 
version-number  in  the  state  of  a  DM  O  for  x  is  a  REQUEST-COMMIT  for  T  operation, 
where  0(T)  -  O  and  T  is  a  write  access  More  specifically,  the  version-number  in  the  state 
of  a  DM  O  after  9  is  data(T)  version-number,  where  REQUEST-COMMIT  for  T  is  the 
last  such  REQUEST-COMMIT  in  9  In  the  definition  of  current-vn(x,  9),  the  set  last(x,3) 
contains  the  last  write  access  for  each  DM  in  dm(x)  that  has  a  REQUEST-COMMIT  for  a 
write  access  in  9  Therefore,  the  maximum  over  all  T6last(x,  9)  of  data(T)  version-number 
is  the  highest  version  number  among  the  states  of  all  DMs  in  dm(z)  after  9  This  maximum 
is  exactly  the  definition  of  current-vn(x,  d)  ■ 

The  following  lemma  is  the  key  to  the  proof  of  Theorem  10  Condition  1  is  only  needed 
for  carrying  through  the  inductive  argument  The  important  part  of  the  lemma  is  Con¬ 
dition  2,  which  tells  us  that  each  read  TM  returns  the  value  expected  as  dictated  by  the 


3  1  REPLICATED  SERI  A  L  S  YSTEM 


35 


previous  logical  write  operations  That  is,  each  read-TM  returns  the  logical-state  of  the 
data  item  Because  the  system  is  serial,  we  are  able  to  carry  out  a  simple  inductive  proof 
using  standard  assertional  techniques  In  the  proof  of  this  lemma,  as  well  as  the  proofs  of 
the  remaining  lemmas  and  theorems,  we  formally  consider  all  details  except  the  precondi¬ 
tions  and  postconditions  for  the  operations  of  the  basic  objects,  because  their  behavior  is 
so  simple,  these  operations  receive  only  informal  treatment.  j 

Lemma  8  Let  x  be  a  logical  data  item  in  I  Let  3  be  a  schedule  of  B  such  that  accesa(x,  3) 
is  of  even  length 

1  The  following  properties  hold  after  3' 

(a)  There  exists  a  write-quorum  q  €  config(x).w  such  that  for  all  DMs  O  e  q,  if  d  is 
the  data  component  of  O,  then  d. version-number  =  current-vn(x,  0) 

(b)  For  all  DMs  O  £  dm(x),  if  d  is  the  data  component  of  O,  then  d  version-number 
=  current-vn(x,  3)  implies  that  d  value  =  logical-state(x,  3) 

2  If  &  ends  in  REQUEST-COMMIT(T,v)  with  Te  tm^x),  then  v  =  logical-state(x,  3) 

Proof  By  induction  on  the  length  of  3- 

Base  case  Let  3  be  the  empty  schedule.  By  definition,  current-vn(x,  3)  —  0  and  logical- 
state^,/?)  -  I,  Initially,  all  DMs  in  dm(x)  have  version-number  =  0  and  value  -  ts  by 
the  definition  of  a  DM  Therefore  the  states  after  3  of  all  the  DMs  in  every  q  €  config(x)  w 
have  version-number  -  current-vn(x,  3)  and  value  -  logical-state(x,  3)  Thus,  part  1  holds 
Since  3  is  empty,  it  does  not  end  in  a  REQUEST-COMMIT  operation  of  a  read-TM  for  x 
So,  part  2  holds  vacuously 

Induction  Let  3  3't,  where  access(x.r)  begins  with  the  last  CREATE  operation 
in  access (x ,  3)  Assume  that  the  Lemma  holds  for  ff  By  Lemma  6  and  the  fact  that 
access(x,  3)  is  of  even  length,  access(x,  r)  (CREATE(T /),  REQUEST-COMMIT(T/ ,v  f)) 
for  some  T/  €  tm(x)  and  v/  f  U,  We  note  the  following  facts  about  T / 

Fact  1  All  accesses  in  r  to  DMs  in  dm(i)  are  descendants  of  Tf 

Proof  Since  3  is  a  serial  schedule,  T /  is  the  only  TM  m  trn(x)  whose  descendants 
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Lave  operations  in  r.  Furthermore,  the  system  type  of  B  is  constrained  so  that  all 
accesses  to  DMs  in  dm(z)  are  children  of  TMs  in  tm(z). 

Fact  l  Let  s  be  the  state  of  T/  after  any  prefix  of  0  If  read(s)  is  non-empty,  then 
data(s)  version-number  and  data(s). value  contain  the  highest  version-number  and  as¬ 
sociated  value  among  the  states  of  the  DMs  in  read(s)  after  01 . 

Proof  By  definition,  a  DM  O  is  added  to  the  read  component  of  T /  only  as 
the  result  of  a  COMMlT(T’,v’)  operation,  where  parent(T’)  =  T/,  T’  is  a  read  ac¬ 
cess,  O(T)  =  O,  and  T /  has  not  invoked  any  prior  REQUEST-CREATEb  for  write 
accesses2  Since  0  is  well- formed,  all  such  COMMIT(T’,v’)  operations  must  occur  in 
r  By  Fact  1 ,  all  accesses  to  O  that  take  place  in  r  are  children  of  Ty.  Since  T/  invokes 
no  write  accesses  prior  to  the  COMMIT(T’,v ’)  operation,  the  data  components  of  the 
DMs  in  dm(x)  for  the  COMMlT(T’,v’)  operation  are  the  same  as  after  0f .  Therefore, 
v'  is  the  data  component  of  the  state  of  O  after  01 .  By  definition,  T/  retains  the 
maximum  version- number  (and  its  associated  value)  among  all  the  return  values  of 
COMMIT  for  T’  operations  that  result  in  O(T’)  being  added  to  the  read  component. 

Fact  $  Let  s  be  the  state  of  T/  after  any  prefix  of  0  If  read(s)  contains  some 
read  quorum  r  6  config(z)  r,  then  data(s)  version-number  =  current-vn(x,  01)  and 
data(s)  value  -  logicai-state(x,  01) 

Proof  By  the  induction  hypothesis,  there  exists  some  write-quorum  w  €  config(x). w 
such  that  the  states  of  all  DMs  in  w  after  01  have  version-number  =  current-vn(x,  01), 
and  every  DM  with  version-number  current-vn(x,  0*)  has  value  =  logical-state(x,  01). 
By  Lemma  7,  current-vn(x,  0 ')  is  the  highest  version  number  among  all  DMs  in  dm(x) 
Since  config(z)  is  a  legal  configuration  of  dm(x),  r  and  u;  must  have  a  non-empty  in¬ 
tersection  So,  read(s)  must  contain  at  least  one  DM  in  u>  Therefore,  by  Fact  2, 
data(s)  version-number  current-vn(x,  0')  and  data(s)  value  =  logical-state(x,  0*) 

From  Fact  1 ,  we  know  that  all  accesses  to  DMs  for  x  in  r  are  children  of  T/  Therefore, 
iri  order  to  prove  that  the  induction  hypothesis  holds  for  0,  we  merely  need  to  demonstrate 
that  T /  preserves  the  properties  stated  There  are  two  possibilities  for  T / 

’Thi«  tnii  <  niiitii  n  it  trivially  true  when  T f  it  a  read  TM 


■  W 


,  -■  c.  r.  r. 


3.1.  REPLICATED  SERIAL  SYSTEM 


37 


•  If  T/  is  a  read-TM,  then  Iogical-state(x,  0)  =  logical-8tate(x,  0>)  by  definition.  Also, 
since  T /  invokes  only  read  accesses,  the  version-number  and  value  components  of  the 
states  of  the  DMs  in  dm(x )  after  (I  are  the  same  as  after  ft,  and  current-vn(x,  0)  = 
current- vn(x,  ft).  Therefore,  part  1  of  the  Lemma  holds  for  0. 

Let  s/  be  the  state  of  T/  when  T /  issues  its  REQUEST-COMMIT  operation. 
The  preconditions  for  REQUEST-COMMIT  require  that  read(s/)  contain  some  read- 
quorum  r  G  eonfig(x).r.  Therefore,  by  Fact  3,  data(s/). value  =  logical-state(x, 01), 
which  equals  logical-state(x,^).  By  definition,  v/  =  data(sy).  value,  so  part  2  of  the 
Lemma  holds  for  0. 

•  If  Tf  is  a  write-TM,  then  logical-state(x,  0)  =  value(T/)  by  definition.  We  note  the 
following  fact  about  T /: 

Fact  4-  For  all  write  accesses  T*  invoked  by  T/,  data(T’)  =  (current-vn(x,/?')+l, 
value(T /)). 

Proof:  Let  s*  be  the  state  of  Tj  when  it  issues  REQUEST-CREATE(T’). 
By  definition,  data(T’)  =  (data(s»).version-number+l,value(T/)).  The  precon¬ 
dition  for  the  REQUEST-CREATE(T’)  operation  requires  that  read(s^)  con¬ 
tain  some  read  quorum  q  €  config{x).r.  Therefore,  by  Fact  3,  data(s^). version- 
number  =  current-vn(x,  0'). 

Let  s/  be  the  state  of  T/  when  T /  issues  its  REQUEST-COMMIT  operation.  The 
preconditions  for  REQUEST-COMMIT  require  that  written(s/)  contain  some  write- 
quorum  w  €  config(x). w.  Furthermore,  no  DM  is  added  to  the  written  component 
of  the  state  of  T/  unless  a  write  access  to  that  DM  has  committed  to  T /.  So,  t 
must  contain  a  REQUEST-COMMIT  operation  for  a  write  access  to  each  DM  in  w. 
After  a  COMMIT  of  a  write  access  T’  to  a  DM,  the  data  component  of  that  DM  is 
equal  to  data(T’).  Therefore,  by  Fact  4,  the  states  after  0  of  all  the  DMs  in  w  must 
have  value  =  value(T/)  and  version-number  =  current-vn(x,  0>)  +  l.  (By  Fact  1,  Tj 
is  the  only  transaction  that  issues  write  accesses  to  DMs  in  dm(x)  in  r.)  By  Lemma 
7,  current-vn(x,  0')  is  the  highest  version-number  among  the  states  of  DMs  in  dm(x) 
after  0'  Since  every  write  access  in  r  to  DMs  in  dm(x)  has  version-number  =  current- 
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vn(x,0,)+l,  we  know  that  this  is  the  highest  version-number  among  DMs  in  dm(x) 
after  fi.  That  is,  current-vn(x,/?')-(-l  =  current-vn(x,£).  Therefore,  since  value(Ty) 
=  logical-state(x,/I),  part  1  of  the  Lemma  holds.  Since  T/  is  not  a  read-TM,  /?  does 
not  end  with  a  REQUEST-COMMIT  of  a  read-TM  for  x,  so  part  2  holds  vacuously. 

Thus,  the  Lemma  holds  in  both  cases.  ■ 

3.2  Non-replicated  Serial  System 

As  the  basis  of  our  correctness  condition,  we  define  non-replicated  serial  system  A  of  type 
(  Ta  , parent  a,  in  terms  of  replicated  serial  system  B  of  type  (  7©  .parents,  Os  ,VB).S 

System  A  is  identical  to  System  B,  except  that  logical  accesses  to  objects  in  I  (which  are 
implemented  as  TMs  in  system  B)  are  implemented  as  accesses  in  system  A,  and  the  logical 
data  items  in  I  (which  are  implemented  as  collections  of  DMs  in  system  B )  are  implemented 
as  single  read-write  objects  in  system  A.  These  changes  are  reflected  in  the  system  type, 
which  is  formally  defined  as  follows: 

•  Ta  -  Tfl  -  I  (J  acc(x) 

\z€l 

•  parent x  =  parents  restricted  to  Ta 

•  0A  =  Ob  -  ^  U  dm(x)j  U  (tm(x)|x  €  7} 

•  VA  =  VB 

Informally,  to  construct  the  type  of  system  A  from  that  of  system  B,  we  first  remove 
from  T  all  the  accesses  to  the  DMs  for  objects  in  7.  As  a  result,  all  the  TMs  for  objects  in 
7  become  leaves  in  T  and  are  therefore  accesses.  Next,  we  remove  from  0  all  the  DMs  for 
objects  in  7.  Also,  we  partition  all  the  accesses  that  were  formerly  TMs  according  to  their 
logical  data  item.  Each  class  of  this  partition  is  a  new  object  in  0.  Thus,  each  logical  data 
item  is  implemented  by  a  single  object. 

3  We  introduce  the  subscript*  to  distinguish  the  components  of  A  from  the  components  of  B. 
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Figure  3.2:  The  transaction  tree  for  system  A  that  corresponds  to  the  transaction  tree  for 
B  shown  in  Figure  3.1.  Transactions  are  labeled  as  follows: 

U  =  user  transaction;  a,  b,  x,  y  =  accesses. 

Figure  3.2  illustrates  the  transaction  tree  for  system  A  that  corresponds  to  the  transac¬ 
tion  tree  for  system  B  given  in  Figure  3.1. 


We  would  like  to  relate  transactions  in  system  B  to  those  in  system  A.  Recall  that  the 
function  Tab  is  well-defined,  provided  that  system  B  is  an  extension  of  system  A.  Thus, 
we  prove  the  following  lemma. 

Lemma  9  System  B  is  an  extension  of  system  A. 

Proof:  We  need  to  show  that  Ta  C  T" b  and  that  Ta  has  the  same  root  as  Tg .  Since 
Ta  =  Tg  -  (U*e/acc(z))>  we  know  that  Ta  C  Tg.  Furthermore,  Ta  and  Tg  must  have  the 
same  root,  unless  the  root  of  Tg  is  in  acc(x)  for  some  i  €  I .  However,  every  member  of 
acc(x)  is  a  child  of  some  member  of  tm(x),  so  no  access  in  acc(x)  for  any  x  E  I  co^.d  be 
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the  root  of  Tg.  m 

We  define  user  transactions  in  system  A  to  be  all  non-access  transactions  in  7 We 
note  that  T  is  a  user  transaction  in  system  B  iff  7ba{T)  is  a  user  transaction  in  system  A. 
This  is  because  if  T  is  a  TM  in  system  B,  then  7ba(T)  is  an  access  transaction. 

Transactions  and  objects  in  system  A  have  the  same  corresponding  automata  as  in 
system  B,  except  that  for  all  x  £  I,  the  following  hold: 

1.  The  object  corresponding  to  tm(x)  is  modelled  as  a  read-write  object  O  over  domain 
Vg  with  initial  value  ix.  (We  refer  to  this  particular  read-write  object  as  O(x).) 

2.  For  each  transaction  T€  fm(i),  7ba( T)  is  an  access  to  O(x)  such  that 

(a)  if  T  is  a  read-TM,  then  7ba{ T)  is  a  read  access,  and 

(b)  if  T  is  a  write-TM,  then  7b  a{ T)  is  a  write  access  with  data(7Bx(T))  =  value(T). 

3.3  Correctness 

In  this  section,  we  prove  that  system  B  is  correct  by  showing  that  user  transactions  cannot 
distinguish  between  replicated  serial  system  B  and  non-replicated  serial  system  A. 

Theorem  10  Let  jS  be  a  schedule  of  replicated  serial  system  B.  There  exists  a  schedule  a 
of  non-replicated  serial  system  A  such  that  the  following  two  conditions  hold. 

1.  For  all  objects  O  in  system  B  that  are  not  in  dm(x)  for  any  x,  a|0  =  /?|0. 

2.  For  all  user  transactions  T  in  system  B ,  a|/flA(T)  =  0\T. 

Proof:  We  construct  a  by  removing  from  0  all  the  REQUEST-CREATE(T),  CRE¬ 
ATE^),  REQUEST-COMMIT(T,v),  COMMIT(T,v),  and  ABORT(T)  operations  for  all 
transactions  T  in  acc(x)  for  all  x  €  I .  Clearly,  the  two  conditions  hold.  What  needs  to  be 
proved  is  that  a  is  a  schedule  of  A.  We  proceed  by  induction  on  the  length  of  0. 

Base  case:  Suppose  0  is  empty.  Then  a  is  also  empty  and  is  therefore  a  schedule  of  A. 
Induction:  Let  0  —  0'xp,  where  the  claim  holds  for  0' .  Let  a  =  o !na,  where  a'  is  the 
schedule  of  A  corresponding  to  0' .  There  are  five  cases  for  np. 
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1.  Invocations,  Operations,  and  Returns  of  Replica  Accesses:  If  *p  is  a  REQUEST- 
CREATE(T),  CREATE(T),  REQUEST-COMMIT(T,v),  COMMIT(T,v),  or  ABORT(T) 
operation  where  T  is  in  acc{x)  and  x  G  I,  then  by  the  construction  jra  is  empty. 
Therefore,  a  is  the  same  as  a',  which  is  a  schedule  of  A. 

2.  REQUEST-COMMITs  for  Non-Replica  Accesses:  is  a  REQUEST-COMMIT 

for  a  non-replica  access  T,  then  by  the  construction  na  =  np.  By  Part  1  of  the  in¬ 
duction  hypothesis,  a'jO  =  Since  O  is  modelled  by  the  same  automaton  in  both 
systems  A  and  B ,  the  states  of  O  after  a!  and  after  /?'  are  the  same.  Furthermore, 
?ba( T)  is  modelled  by  the  same  automaton  as  T.  Therefore,  since  the  preconditions 
for  xp  are  satisfied  in  the  state  of  T  after  /?',  they  must  also  be  satisfied  in  the  state  of 
7ba( T)  after  a'.  Therefore  ai/jBA(T)  is  a  schedule  of  7ba{T).  So,  by  the  Composition 
Lemma  (Lemma  1),  a  is  a  schedule  of  A. 

3.  Output  Operations  of  User  Transactions:  If  irp  is  an  output  operation  of  some 
user  transaction  T,  then  by  the  construction  xa  =  xp.  By  Part  2  of  the  induction 
hypothesis,  ck'|7ba(T)  =  j3'j(T).  Furthermore,  ?ba(T)  is  modelled  by  the  same  au¬ 
tomaton  as  T.  Therefore,  since  the  preconditions  for  up  are  satisfied  in  the  state  of 
T  after  /?',  they  must  also  be  satisfied  in  the  state  of  7b  a{ T)  after  a'.  Therefore, 
(*\7ba{T)  is  a  schedule  of  7ba{ T).  So,  by  the  Composition  Lemma,  a  is  a  schedule  of 
A. 

4.  Output  Operations  of  TMs  (except  those  already  covered  by  Case  1):  If  xp  is  a 
REQUEST-COMMIT(T,v),  where  Te  tm(x)  for  some  x  G  I ,  then  by  the  construc¬ 
tion  ?ra  =  np.  By  the  definition  of  system  A,  7ba( T)  is  an  access  to  a  read-write 
object.  The  only  precondition  for  a  REQUEST-COMMIT  of  T,  then,  is  that  T  has 
been  created.  By  the  construction  and  the  fact  that  /?  is  a  well-formed  schedule,  CRE- 
ATE(T)  occurs  in  a'.  Therefore,  the  precondition  for  REQUEST-COMMIT(T,t/)  is 
satisfied  in  A  for  some  v'. 

If  T  is  a  write-TM,  then  v  =  v'  =  nil.  We  need  to  show  that  v  =  v'  if  T  is 
a  read-TM.  By  Lemma  8,  we  know  that  t;  =  logical-state(x, 0').  By  definition  of  a 
read-write  object,  v'  is  the  value  in  the  state  of  O(x)  after  a'.  We  observe  that,  by 
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the  construction,  a'|0(x)  =  access(x,y9').  So,  by  definition  of  system  A,  the  last  write 
access  in  a’  to  O(x)  has  the  same  value  as  the  last  write-TM  in  Hence,  the  value 
in  the  state  of  O(x)  after  a '  is  logical-state(x,/9').  Therefore,  v  =  v'.  (If  there  is  no 
write-TM  in  access(x,j9,))  then  there  are  no  write  accesses  to  O(x)  in  a'.  In  this  case, 
the  value  in  the  state  of  O(x)  after  a'  is  iz,  which  is  logical-state(x,^').) 

5.  Output  Operations  of  the  Scheduler  (except  those  already  covered  by  Case  1): 
If  ft p  is  a  CREATE(T),  a  COMMIT  for  T,  or  an  ABORT(T),  where  T  is  a  user 
transaction,  T  is  a  non-replica  access,  or  T  6  tm(x)  for  some  x  £  J,  then  by  the 
construction  fta  =  ftp. 

If  ftp  is  a  CREATE(T)  or  ABORT(T),  then  the  preconditions  for  jtq  are  (1)  there 
must  be  a  REQUEST-CREATE(T)  in  a'  but  no  CREATE(T)  or  ABORT(T)  in  a', 
and  (2)  all  siblings  of  7ba( T)  with  creates  in  a'  must  have  returned  (committed 
or  aborted)  in  a'.  Since  is  a  well-formed  schedule,  REQUEST-CREATE(T)  is  in 
/?',  and,  by  the  construction,  is  in  a'  as  well.  Similarly,  since  no  CREATE(T)  or 
ABORT(T)  can  occur  in  f}' ,  none  can  occur  in  a'  either.  Therefore,  precondition  (1) 
is  satisfied.  By  the  construction,  all  commits  and  aborts  in  a'  of  siblings  of  7ba (T) 
must  also  appear  in  /?'.  So,  since  /?  is  well-formed,  precondition  (2)  must  also  be 
satisfied. 

If  ftp  is  a  COMMIT(T,v),  then  the  preconditions  for  fta  are  (1)  a  REQUEST- 
COMMIT(T,v)  must  occur  in  a',  (2)  7ba{T)  cannot  have  a  COMMIT  or  ABORT 
in  a',  and  (3)  any  children  invoked  by  7ba(T)  must  have  returned  in  a'.  Using  the 
same  argument  as  above,  by  the  construction  and  the  fact  that  (3  is  well-formed, 
preconditions  (1)  and  (2)  must  be  satisfied.  If  T  is  a  non-replica  access  or  T  €  tm{x) 
for  some  x  €  I ,  then  7b a{ T)  cannot  have  any  children  in  A.  If  T  is  a  user  transaction, 
then  all  return  operations  of  the  children  of  T  in  /?'  are,  by  the  construction,  included 
in  a'.  Therefore,  since  0  is  well-formed,  precondition  (3)  must  be  satisfied. 


In  all  cases,  a  is  a  schedule  of  A. 
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3.4  Concurrent  Replicated  Systems 

So  far,  we  have  been  able  to  deal  exclusively  with  serial  systems  in  order  to  simplify  our 
reasoning.  We  now  complete  the  correctness  proof  by  showing  that  non-serial  replicated 
systems  are  correct.  Recall  the  definition  of  serial  correctness:  Let  S  be  a  serial  system,  and 
let  7  be  an  arbitrary  sequence  of  operations.  We  say  that  7  is  serially  correct  with  respect 
to  S  for  transaction  T  provided  that  7|T  =  tr|T  for  some  schedule  a  of  S. 

With  the  following  theorem,  we  show  that  given  a  correct  concurrency  control  algorithm, 
combining  that  algorithm  with  our  replication  algorithm  yields  a  correct  system.  This 
theorem  allows  us  to  achieve  a  complete  separation  of  the  issues  of  concurrency  control  and 
recovery  from  the  issues  of  replication.  In  other  words,  one  may  prove  a  concurrency  control 
algorithm  correct,  then  separately  prove  a  replication  algorithm  correct  for  serial  systems, 
and  finally  apply  this  theorem  to  show  that  the  (combined)  concurrent  replicated  system 
is  correct.  The  modularity  of  this  proof  method  permits  us  to  ignore  all  the  complicated 
interactions  of  the  two  algorithms  that  one  would  need  to  consider  in  a  direct  proof  that 
the  concurrent  replicated  system  simulates  a  non-replicated  serial  system. 

Theorem  11  Let  C  be  any  system  that  has  the  same  type  as  system  B,  and  let  the  set  of 
user  transactions  in  C  be  the  same  as  in  B.  Assume  that  all  schedules  7  of  C  are  serially 
correct  with  respect  to  serial  system  B  for  all  non-orphan4  non-access  transactions.  Then 
all  schedules  7  of  C  are  serially  correct  with  respect  to  system  A  for  all  non-orphan  user 
transactions. 

Proof:  An  immediate  consequence  of  Theorem  10.  ■ 

So,  any  concurrency  control  algorithm  that  provides  serializability  at  the  level  of  the 
copies  may  be  combined  with  the  Fixed  Quorum  Consensus  replica  management  algorithm 
to  produce  a  correct  system.  Interesting  concurrency  control  algorithms  that  satisfy  this 
condition  include  Reed’s  multi-version  timestamp  concurrency  control  algorithm  [R]  and 
Moss’  two  phase  locking  algorithm  with  separate  read  and  write  locks  [Mo],  (See  also  the 
correctness  proof  given  by  Fekete  et  al.  (FLMWj.) 

4  A  a  transaction  T  is  an  orphan  in  7  if  ABORT(T’)  occurs  in  7  for  some  ancestor  T’  of  T. 


Reconfigurable  Quorum 
Consensus 


We  now  extend  the  results  of  Chapter  3  to  systems  that  permit  reconfiguration.  That  is, 
we  permit  the  read-  and  write-quorums  to  change  dynamically,  rather  than  fixing  them  for 
the  entire  execution.  This  flexibility  is  important  for  coping  with  site  and  link  failures  in 
practical  systems.  For  example,  if  some  DMs  are  down,  we  may  want  to  change  the  quorums 
so  that  logical  accesses  can  be  processed  in  spite  of  the  failures. 

We  redefine  systems  A  and  B  and  present  proofs  analogous  to  those  for  the  fixed  con¬ 
figuration  systems.  In  doing  so,  some  interesting  new  considerations  arise:  As  before,  the 
logical  accesses  are  described  in  terms  of  read-  and  write-TMs.  However,  we  also  need  a 
new  kind  of  TM,  called  a  reconfigure-TM,  to  effect  changes  in  the  quorums.  We  would 
like  the  reconfigure-TMs  to  be  modelled  as  transactions  for  the  sake  of  uniformity,  and  to 
be  positioned  in  the  tree  as  children  of  the  user  transactions  in  order  to  model  the  correct 
atomicity  requirements.  For  instance,  if  T  and  T’  are  TMs  for  x  that  are  invoked  by  the 
same  user  transaction,  we  would  like  to  permit  reconfiguration  of  x  to  take  place  between 
the  COMMIT  of  T  and  the  CREATE  of  T\  However,  the  reconfigure-TMs  are  special  in 
that  their  invocations  and  returns  are  not  to  be  controlled,  or  even  seen,  by  the  user  trans¬ 
actions.  Rather,  they  are  intended  to  run  spontaneously  and  transparently  from  the  user’s 
point  of  view.  So,  we  want  the  reconfigure-TMs  to  be  positioned  in  the  tree  as  children  of 
the  user  transactions,  but  we  do  not  want  the  user  programs  to  be  aware  of  their  invocations 
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and  returns. 

This  conflict  introduces  a  modelling  problem.  We  solve  the  problem  by  associating  a  spy 
automaton  with  each  user  transaction.  The  spy  wakes  up  with  the  associated  transaction 
and  nondeterministically  invokes  reconfigure-TMs  until  the  associated  transaction  requests 
to  commit.  In  this  way,  we  capture  formally  the  notions  of  spontaneity  and  transparency 
while  at  the  same  time  modelling  the  proper  atomicity  requirements. 

Gifford’s  reconfiguration  algorithm  works  as  follows.1  In  addition  to  a  value  and  a 
version  number,  each  replica  of  z  contains  a  configuration  and  a  generation  number.  The 
value  and  version  number  are  initialized  as  in  the  non-reconfiguration  case,  and  all  replicas 
of  x  initially  hold  the  same  configuration  and  generation  number. 

To  perform  a  logical  read  of  z,  a  TM  reads  DMs  for  z,  keeping  in  its  state  the  value  v 
and  version  number  t  from  the  DM  with  the  highest  version  number  seen,  the  configuration 
c  and  generation  number  g  from  the  copy  with  the  highest  generation  number  seen,  and  the 
set  d  of  the  names  of  the  DMs  read.  If  the  TM  reaches  a  state  in  which  c  has  a  read-quorum 
that  is  a  subset  of  d,  then  the  TM  returns  t;. 

To  perform  a  logical  write  of  z  with  new  value  v\  a  TM  again  reads  DMs  for  z,  keeping 
in  its  state  the  version  number  t  from  the  DM  with  the  highest  version  number  seen,  the 
configuration  c  and  generation  number  g  from  the  DM  with  the  highest  generation  number 
seen,  and  the  set  d  of  the  names  of  the  DMs  read.  If  the  TM  reaches  a  state  in  which  c 
has  a  read-quorum  that  is  a  subset  of  d,  then  the  TM  computes  the  new  version  number 
t1  =  t  +  1  and  writes  v'  along  with  t'  to  some  write-quorum  of  DMs  in  c. 

To  reconfigure  z  with  new  configuration  c',  a  TM  first  reads  DMs  for  z  and  computes 
v,t,c,g,  and  d,  just  as  for  a  logical  read.  If  the  TM  reaches  a  state  in  which  c  has  a 
read-quorum  that  is  a  subset  of  d,  then  the  TM  does  the  following.  It  writes  v  and  t  to  a 
write-quorum  in  c',  and  it  writes  c'  and  g'  =  g  +  1  to  a  write-quorum  in  c? 

We  generalize  Gifford’s  reconfiguration  algorithm  in  the  same  ways  that  we  generalized 
the  fixed  quorum  consensus  algorithm  in  the  previous  chapter.  A  formal  description  of  the 


‘in  [Gi],  Gifford  describes  the  algorithm  in  terms  of  volet.  However,  we  substitute  the  more  general 
configuration  definition. 

’The  description  in  [Gi]  actually  requires  that  the  new  configuration  be  written  to  both  an  old  and  a  new 
write-quorum.  However,  we  find  that  it  is  only  necessary  to  write  this  information  to  an  old  write-quorum. 
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generalized  algorithm  follows.  Because  of  the  additional  complication  involved  in  reconfig¬ 
uration  and  in  order  to  avoid  needless  repetition  of  code,  we  separate  the  read,  write,  and 
reconfigure  tasks  of  the  TMs  into  modules  called  coordinators.  This  is  done  most  natu¬ 
rally  by  introducing  another  level  of  nesting,  providing  additional  evidence  of  the  power  of 
nesting  as  a  modelling  tool. 

The  formalisms  and  proofs  of  this  section  follow  the  same  pattern  as  those  of  the  previous 
section. 

4.1  Reconfigurable  Replicated  Serial  System 

Like  the  fixed  configuration  system,  the  replicated  serial  system  defined  in  this  section  is 
an  ordinary  serial  system  in  which  certain  logical  data  items  are  replicated.  We  impose 
a  restriction  on  the  transaction  tree  that  all  accesses  to  the  replicas  are  the  children  of 
coordinator  automata,  which  are,  in  turn,  children  of  TMs.  Together,  the  coordinators  and 
TMs  model  the  Quorum  Consensus  algorithm  itself.  We  model  the  logical  operations  of  the 
algorithm  by  providing  three  kinds  of  TMs:  read- TMs,  write- TMs,  and  reconfigure-TMs. 
The  system  type  is  formally  defined  as  follows. 

Fix  I,  a  set  of  logical  data  items.  We  define  system  B  to  be  a  serial  system  of  type 
(T, parent,  0,V).  For  each  element  i  of  I ,  we  define: 

•  dm(z),  a  subset  of  0, 

•  acc(x),  a  subset  of  the  accesses  in  T, 

•  co(z),  a  subset  of  the  non-accesses  in  T, 

•  tmr(z),  tmw(x),  and  tmree(z),  disjoint  subsets  of  the  non-accesses  in  T, 

•  config(x),  a  legal  configuration  of  z. 

Let  tm(z)  =  tmr(z)  u  tmw(z)  U  tm^z).  We  require  that  acc(z)  is  exactly  the  set  of  all 
accesses  to  objects  in  dm(x).  Also,  we  require  that  Teacc(z)  iff  parent(T)eco(z),  and  that 
T€co(z)  iff  parent(T)€tm(z).  That  is,  the  accesses  to  DMs  for  z  are  exactly  the  children 
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of  the  coordinators  for  x,  which  are,  in  turn,  exactly  the  children  of  the  TMs  for  x.  Finally, 
for  all  pairs  x,y  e  I,  we  require  that  dm(x)  n  dm(y)  =  0.  As  a  notational  convenience,  we 
sometimes  drop  the  “(x)”  in  dm(x),  acc(x),  etc.  to  denote  the  union  of  these  sets  over  all 
x  e  I .  For  example,  tmw  is  the  union  over  all  x  G  I  of  tmrec(x). 

The  set  of  user  transactions  in  system  B  consists  of  all  non-access  transactions  that  are 
neither  in  co  nor  tm.  We  refer  to  accesses  in  acc  as  replica  accesses,  and  to  the  remaining 
accesses  in  T  as  non-replica  accesses. 

Figure  4.1  provides  an  example  of  a  possible  transaction  tree  for  system  B. 

Each  member  of  dm(x)  has  an  associated  DM  automaton  for  x.  Each  member  of  co(x) 
has  an  associated  read-coordinator,  write-coordinator,  or  reconfigure-coordinator  automa¬ 
ton  for  x.  The  members  of  tmr(x),  tmw(x),  and  tm^x)  have  associated  read-TM,  write- 
TM,  and  reconfigure-TM  automata  for  x,  respectively. 

Each  user  transaction  T  has  an  associated  automaton  that  is  the  composition  of  a  user 
automaton  and  a  spy  automaton  T,py.  The  user  automaton  may  be  any  arbitrary  automaton 
that  satisfies  the  definition  of  a  transaction,  and  does  not  have  REQUEST-CREATE(T’), 
COMMIT  for  T’,  or  ABORT(T’)  operations  defined  for  any  reconfigure-TM  T\  The  set 
spies  refers  to  the  collection  of  all  spy  automata  in  system  B. 

To  avoid  confusion,  the  reader  should  note  that  “user  transaction”  refers  to  the  name  in 
T,  whereas  “user  transaction  automaton”  refers  to  the  automaton  itself  (the  composition 
of  the  user  automaton  and  the  spy  automaton). 

To  access  x,  a  transaction  invokes  some  read-  or  write-TM  in  tm(x).  This  TM  invokes 
one  or  more  coordinators,  each  of  which  invokes  read  or  write  accesses  to  multiple  DMs.  The 
definitions  constrain  T  so  that  all  accesses  to  x  must  proceed  in  this  fashion;  no  high-level 
transaction,  for  example,  can  directly  invoke  a  coordinator  for  x  or  an  access  to  a  DM  for 
x.  DMs,  coordinators,  TMs,  and  spy  automata  are  described  in  the  next  four  subsections. 

4.1.1  Data  Managers 

As  before,  the  set  of  data  managers  for  logical  data  item  x  models  the  set  of  physical 
replicas  of  x.  Each  DM  is  a  read-write  object  that  keeps  a  version-number,  a  value,  a 
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Figure  4.1:  A  possible  transaction  tree  for  system  B.  Transactions  are  labeled  as  follows: 
U  =  user  transaction;  TM  =  transaction  manager;  C  =  coordinator;  a,b  =  non-replica 
accesses;  x\  —  access  to  replica  1  of  logical  data  item  x,  etc. 

generation-number,  and  a  configuration  for  x.  The  formal  definition  follows. 

If  z  is  a  logical  data  item  in  J,  a  DM  for  x  in  system  jB  is  a  read-write  object  over 
domain  Dz  =  N  x  Vx  x  N  x  legal(dm(x))  and  with  initial  data  (0,ix,0,  config(x)).  We 
refer  to  each  member  of  D*  as  a  (version-number,  value,  generation-number,  configuration) 
quadruple. 

Lemma  12  DMs  are  basic  objects. 

Proof:  Immediate  from  Lemma  2.  ■ 
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4.1.2  Coordinators 

We  now  define  the  coordinators,  which  form  the  intermediate  level  of  nesting  between  the 
replica  accesses  and  the  TMs.  There  are  three  types  of  coordinators  read,  write,  and 
reconfigure  Following  the  three  coordinator  definitions,  we  will  define  the  TMs 

Read  Coordinators:  Let  x  be  a  logical  data  item  in  J  The  purpose  of  a  read-coordinator 
is  to  calculate  the  “current”  version-number,  value,  generation-number,  and  configuration 
of  x  on  the  basis  of  the  data  returned  by  the  read  accesses  it  invokes 

Read-coordinator  T  has  state  components  awake,  data,  requested,  and  read,  where 
awake  is  a  boolean  variable,  data  is  in  the  domain  Dz,  requested  is  a  subset  of  acc(x),  and 
read  is  a  subset  of  dm{ x)  Initially,  awake  is  false,  data  is  (0,  ±,0,  config(x)) ,  and  requested 
and  read  are  both  empty. 

Input  operations:  CREATE(T) 

COMMIT(T’.v),  where  T’  G  children(T)  and  v  G  Dx 
ABORT(T’),  where  T’  G  children(T) 

Output  operations.  REQUEST-CRE ATE(T’) ,  where  T’  G  children(T) 
REQUEST-COMMIT(T,v),  where  v  G  Dz 

•  CREATE(T) 

Postcondition:  awake(s)  =  true 

•  REQUEST-CREATE(T’),  where  kind(T’)  =  read 

Precondition  awake(s’)  =  true 
T’  requested(s’) 

Postcondition:  requested(s)  =  requested(s’)  U  {T’} 

•  COMMIT(T’,v) 

Postcondition:  read(s)  =  read(s’)  U  (O(T’)} 

if  (v. version-number  >  data(s’)  version-number)  then 
data(s). version-number  =  v. version-number 
data(s). value  =  v. value 

if  (v. generation-number  >  data(s’). generation-number)  then 
data(s). generation-number  =  v.  generation-number 
data(s). configuration  -  v. configuration 
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•  ABORT(T’) 

Postcondition:  (no  change) 

•  REQUEST-COMMIT(T.v) 

Precondition:  awake(s’)  =  true 

q  6  data(s'). configuration. r 
q  C  read(s’) 
v  =  data(s’) 

Postcondition:  awake(s)  —  false 

A  read-coordinator  collects  data  from  DMs  for  z,  and  keeps  track  of  the  configuration 
from  the  DM  with  the  highest  generation  number  and  the  value  from  the  DM  with  the  high¬ 
est  version  number  seen  so  far  Whenever  the  read-coordinator  reaches  a  state  in  which  some 
read-quorum  in  the  current  configuration  (i.e.,  some  member  of  data(s’). configuration. r)  is 
a  subset  of  the  DMs  it  has  seen  (i.e.,  read(s’)),  then  the  read-coordinator  may  request  to 
commit  and  return  its  data.  The  reader  should  compare  the  code  above  with  the  code  for 
read-TMs  in  Chapter  3 

Write  Coordinators:  Let  z  be  a  logical  data  item  in  I.  The  purpose  of  a  write- 
coordinator  is  to  write  a  given  value  to  a  write-quorum  of  DMs  for  z  in  a  given  configuration 
of  dm(x). 

A  write-coordinator  T  has  state  components  awake,  requested,  and  written,  where  awake 
is  a  boolean  variable,  requested  is  a  subset  of  acc(z),  and  written  is  a  subset  of  dm(z). 
Initially,  awake  is  false  and  the  sets  are  empty.  Every  write-coordinator  T  for  z  has  an 
associated  value  valve(T)  (=  VXl  an  associated  version-number  version-number(T)  €  N,  and 
an  associated  configuration  configuration(T)  €  legal(dm(x)). 

Input  operations:  CREATE(T) 

COMMIT(T’,v),  where  T’  €  children(T) 

ABORT(T’),  where  T’  €  children(T) 

Output  operations:  REQUEST-CREATE(T’),  where  T’  e  children(T) 
REQUEST-COMMIT(T,v),  where  v  =  nil 
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•  CREATE(T) 

Postcondition:  awake(s)  =  true 


•  REQUEST-CREATE(T’),  where  kind(T’)  =  write  and  data(T’)  =  d 
Precondition:  awake(s’)  =  true 

d  =  {version-number(T),value(T),_L,  1) 

T’  £  requested(s’) 

Postcondition:  requested  (s)  =  requested  (s  ’)  u  {T'} 


•  COMMIT(T’.v) 

Postcondition:  written(s)  =  written(s  ’)  U  {O(T’)} 


•  ABORT(T’) 

Postcondition:  (no  change) 


•  REQUEST-COMMIT(T.v) 

Precondition:  awake(s’)  =  true 

v  =  nil 

r  e  configuration(T).w 
q  C  written(s’) 

Postcondition:  awake  =  false 

When  created,  a  write-coordinator  begins  invoking  write  accesses  to  DMs  for  x,  over¬ 
writing  the  version-numbers  and  values  at  the  DMs  with  its  version-number  and  value,  but 
leaving  the  generation-numbers  and  configurations  at  the  DMs  unchanged.  After  writing  to 
a  write-quorum  of  DMs  according  to  its  configuration,  the  write-coordinator  may  request 
to  commit. 


Reconfigure  Coordinators:  Let  x  be  a  logical  data  item  in  I.  The  purpose  of  a 
reconfigure-coordinator  is  to  write  a  given  new  configuration  for  x  along  with  a  given  gen¬ 
eration  number  to  a  write-quorum  of  DMs  in  a  given  old  configuration  for  x. 

A  reconfigure-coordinator  T  has  state  components  awake,  requested,  and  written,  where 
awake  is  a  boolean  variables,  requested  is  a  subset  of  acc(x),  and  written  is  a  subset  of  dm(x). 
Initially,  awake  is  false  and  the  sets  are  empty.  Every  reconfigure-coordinator  T  for  x  has 
associated  configurations  new- configuration^),  old- configuration^)  e  legal(dm(x)),  and 
an  associated  generation-number  generation-number(T)  e  N. 


* 
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Input  operations:  CREATE(T) 

COMMIT(T’,v),  where  T’  G  children(T) 

ABORT(T’),  where  T’  G  children(T) 

Output  operations:  REQUEST-CREATE(T’),  where  T’  G  children(T) 
REQUEST-COMMIT(T,v),  where  v  =  nil 

•  CREATE(T) 

Postcondition:  awake(s)  =  true 

•  REQUEST-CREATE(T’) ,  where  kind(T’)  =  write  and  data(T’)  =  d 

Precondition:  awake(s’)  =  true 

d  =  (_L,  perp, generation-number(T),  new-configuration(T)  ) 

T’  ^  requested(s’) 

Postcondition:  requested(s)  =  requested(s’)  U  (T’} 

•  COMMIT(T’.v) 

Postcondition:  written(s)  =  written(s’)  U  (O(T’)} 

•  ABORT(T’) 

Postcondition:  (no  change) 

•  REQUEST-COMMIT(T.v) 

Precondition:  awake(s’)  =  true 

v  =  nil 

q  G  old-configuration(T).w 
q  C  written(s’) 

Postcondition:  awake  =  false 

When  created,  a  reconfigure-coordinator  begins  invoking  write  accesses  to  the  DMs  for  z, 
writing  its  generation-numbers  and  new-configurations  to  the  DMs,  but  leaving  the  version 
numbers  and  values  unchanged.  When  an  old  write-quorum  of  DMs  has  been  written, 
according  to  its  old-configuration,  the  reconfigure-coordinator  may  request  to  commit.  This 
is  an  optimization  over  Gifford’s  algorithm.  Gifford  requires  that  the  new  configuration  be 
written  a  new  write-quorum,  as  well  as  to  an  old  write-quorum. 

Lemma  13  Coordinators  are  transactions. 

Proof:  The  proof  is  identical  to  that  of  Lemma  4.  ■ 


4.1.  RECONFIGURABLE  REPLICATED  SERIAL  SYSTEM 


53 


4.1.3  Transaction  Managers 

We  now  define  the  three  kinds  of  TMs:  read,  write,  and  reconfigure.  Read-  and  write-TMs 
are  invoked  by  user  automata  in  order  to  perform  logical  reads  and  writes  to  logical  data 
items.  Reconfigure-TMs  are  invoked  by  spy  automata,  which  are  defined  following  the  TM 
definitions. 


Read  TMs:  Let  x  be  a  logical  data  item  in  I .  The  purpose  of  a  read-TM  is  to  perform 
a  logical  read  access  to  x  on  behalf  of  a  user  transaction. 

Read-TM  T  has  state  components  awake,  data,  requested,  and  read,  where  awake  and 
read  are  boolean  variables,  data  is  in  the  domain  Dx,  and  requested  is  a  subset  of  co{x). 
Initially,  the  booleans  are  false,  data  is  undefined,  and  requested  is  empty. 

Input  operations:  CREATE(T) 

COMMIT(T’,v))  where  T’  €  children(T)  and  v  £  Dz 
ABORT(T’),  where  T’  €  children(T) 

Output  operations:  REQUEST-CREATE(T’),  where  T’  £  children(T) 
REQUEST-COMMIT(T.v),  where  v  £  Dz 

•  CREATE(T) 

Postcondition:  awake(s)  =  true 


•  REQUEST-CREATE(T’),  where  T’  is  a  read-coordinator 
Precondition:  awake(s’)  =  true 

read(s’)  =  false 
T’  0  requested(s’) 

Postcondition:  requested(s)  =  requested(s  ’)  U  {T’} 


•  COMMIT(T’.v) 

Postcondition:  if  read(s’)  =  false  then 
data(s)  =  v 
read(s)  =  true 

•  ABORT(T’) 

Postcondition:  (no  change) 
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•  REQUEST-COMMIT(T,v) 

Precondition:  awake(s’)  =  true 

read(s’)  =  true 
v  =  data  (s’),  value 
Postcondition:  awake(s)  =  false 

A  read-TM  invokes  any  number  of  read-coordinators.  After  one  or  more  these  coor¬ 
dinators  commits,  the  read-TM  may  commit,  returning  the  value  component  of  the  data 
returned  by  the  first  committing  read-coordinator. 


Write  TMs:  Let  x  be  a  logical  data  item  in  I .  The  purpose  of  a  write-TM  for  i  is  to 
perform  a  logical  write  access  to  z  on  behalf  of  a  user  transaction. 

Write-TM  T  has  state  components  awake,  data,  requested,  read,  and  written,  where 
awake,  read,  and  written  are  boolean  variables,  data  is  in  the  domain  Dx,  and  requested 
is  a  subset  of  co(x).  Initially,  the  booleans  are  false,  data  is  undefined,  and  requested  is 
empty.  Every  write-TM  T  has  an  associated  value  value(T). 

Input  operations:  CREATE(T) 

COMMIT(T’,v),  where  T’  G  children(T)  and  v  G  Dx 
ABORT(T’),  where  T’  G  children(T) 

Output  operations:  REQUEST-CREATE(T’),  where  T’  G  children(T) 
REQUEST-COMMIT(T,v),  where  v  =  nil 

•  CREATE(T) 

Postcondition:  awake(s)  =  true 

•  REQUEST-CREATE(T’),  where  T’  is  a  read-coordinator 

Precondition:  awake(s’)  =  true 

T’  £  requested(s’) 

Postcondition:  requested  (s)  =  requested  (s  ’)  U  {T’} 

•  COMMIT(T’,v),  where  T’  is  a  read-coordinator 

Postcondition:  if  read(s’)  =  false  then 
data(s)  =  v 
read(s)  =  true 


'  •i"  i^*  \  •*,  •*,**".***,  -*  */  *  ■  ~  r  ^  *  .*>  »  ,*»  .*»  V  • 


1 


J9 

m 


4.1.  RECONFIGURABLE  REPLICATED  SERIAL  SYSTEM  55 

•  REQUEST-CREATE(T’),  where  T’  is  a  write-coordinator 

Precondition:  awake(s’)  =  true 

read(s’)  =  true 
value(T’)  =  value(T) 

version-number(T’)  =  data(s’)  .version-number-}- 1 
configuration(T’)  =  data(s’)  .configuration 
T’  ^  requested(s’) 

Postcondition:  requested(s)  =  requested(s  ’)  U  {T>} 

•  COMMIT(T’,v).  where  T’  is  a  write-coordinator 

Postcondition:  written(s’)  =  true 

•  ABORT(T’) 

Postcondition:  (no  change) 

•  REQUEST-COMMIT(T,v) 

Precondition:  awake(s’)  =  true 

written(s’)  =  true 
v  =  nil 

Postcondition:  awake(s)  =  false 

A  write-TM  invokes  some  number  of  read-coordinators;  when  the  first  read-coordinator 
commits,  the  write-TM  remembers  the  data  returned.  The  write-TM  then  has  the  option 
of  invoking  any  number  of  write  coordinators,  using  the  configuration  and  version-number 
(incremented  by  one)  it  remembered  from  the  first  committing  read-coordinator,  along 
with  its  particular  data  value.  In  order  for  the  write-TM  to  commit,  at  least  one  of  the 
write-coordinators  must  have  committed. 


Reconfigure  TMs:  Let  x  be  a  logical  data  item  in  I.  The  purpose  of  a  recon figu re- TM  is 
to  change  the  “current”  configuration  of  x  to  a  given  target  configuration  and  to  propagate 
the  current  value  and  version  number  as  necessary. 

Reconfigure-TM  T  has  state  components  awake,  data,  requested,  read,  and  written, 
where  awake,  read  and  written  are  boolean  variables,  data  is  in  the  domain  Dz,  and  re¬ 
quested  is  a  subset  of  co(x).  Initially,  the  booleans  are  false,  data  is  undefined,  and  requested 
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empty.  Every  reconfigure-TM  T  has  an  associated  configuration  target-configuration(T) 
legal  (dm(x)). 

Input  operations:  CREATE(T) 

COMMIT(T’,v),  where  T’  £  children(T)  and  v  £  Dz 
ABORT(T’),  where  T’  £  children(T) 

Output  operations:  REQUEST-CREATE(T’),  where  T’  £  children(T) 
REQUEST-COMMIT(T,v),  where  v  =  nil 

•  CREATE(T) 

Postcondition:  awake(s’)  =  true 

•  REQUEST-CREATE(T’),  where  T’  is  a  read-coordinator 

Precondition:  awake(s’)  =  true 

T’  £  requested  (s’) 

Postcondition:  requested(s)  =  requested(s’)  U  {T’} 

•  COMMIT(T’,v),  where  T’  is  a  read-coordinator 

Postcondition:  if  read(s’)  =  false  then 
data(s)  =  v 
read(s)  —  true 

•  REQUEST-CREATE(T’),  where  T’  is  a  write-coordinator 

Precondition:  awake(s’)  =  true 

read  (s’)  =  true 
value(T’)  =  data(s’).  value 

version-number(T’)  =  data(s’)  .version-number 
configuration(T’)  =  target-configuration(T) 

T’  ^  requested(s’) 

Postcondition:  requested(s)  =  requested(s’)  U  {T’} 

•  COMMIT(T’)v),  where  T’  is  a  write-coordinator 

Postcondition:  written(s’)  =  true 


•  REQUEST-CREATE(T’),  where  T’  is  a  reconfigure-coordinator 
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Precondition:  awake(s’)  =  true 

read(s’)  =  true 

old-configuration(T’)  =  data(s’)  .configuration 
new-configuration(T’)  =  target-configuration(T) 
generation-number(T’)  =  data(s’).generation-number+l 
T’  £  requested(s’) 

Postcondition:  requested  (s)  =  requested  (s  ’)  U  {T>} 

•  COMMIT(T’,v),  where  T’  is  a  reconfigure-coordinator 

Postcondition:  (no  change) 

•  ABORT(T’) 

Postcondition:  (no  change) 

•  REQUEST-COMMIT(T,v) 

Precondition:  awake(s’)  =  true 

written(s’)  =  true 
v  =  nil 

Postcondition:  awake  (s)  =  false 

A  reconfigure-TM  invokes  some  number  of  read-coordinators;  when  the  first  read- 
coordinator  commits,  the  reconfigure-TM  remembers  the  data  returned.  Then,  the  reconfigure- 
TM  may  invoke  any  number  of  write  coordinators  with  its  target-configuration,  and  with  the 
value  and  version  number  from  the  data  returned  by  the  first  committing  read-coordinator. 
Also,  the  TM  may  invoke  any  number  of  reconfigure-coordinators,  where  the  old  configura¬ 
tion  and  old  generation  number  are  those  from  the  first  committing  read-coordinator  and 
the  new  configuration  is  the  TM’s  target-configuration.  In  order  to  commit,  at  least  one 
write-coordinator  must  have  committed.3 

Lemma  14  TMs  are  transactions. 

Proof:  The  proof  is  identical  to  that  of  Lemma  4.  ■ 

Recall  that  reconfigure-TMs  are  invoked  only  by  spy  automata,  which  are  composed 
with  user  automata  to  form  user  transaction  automata.  (See  Lemma  15.)  Spy  automata 
are  defined  as  follows. 

3There  is  no  need  to  wait  for  a  reconfigure-coordinator  to  commit:  Should  all  the  reconfigure-coordinators 
abort,  the  reconfigure-TM  would  have  merely  propagated  the  current  value. 
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4.1.4  Spy  Automata 

The  spy  automaton  T,pv  associated  with  user  transaction  T  has  state  components  awake 
and  create- requested,  where  awake  is  a  boolean  and  create-requested  is  a  subset  of  tm,*,. 
Initially,  awake  is  false  and  create-requested  is  empty. 

Input  operations:  CREATE(T) 

COMMIT(T’,v),  where  T*  6  children(T) 

ABORT(T’),  where  T’  €  children(T) 

REQUEST-COMMIT(T,v) 

Output  operations:  REQUEST-CREATE(T’),  where  T’  €  children(T) 

•  CREATE(T) 

Postcondition:  awake(s)  =  true 

•  REQUEST-CREATE(T’),  where  T’  is  a  reconfigure-TM 

Precondition:  awake(s’)  =  true 

T’  £  create-requested(s’) 

Postcondition:  create-requested (s)  =  create-requested(s’)  U  {T’} 

•  COMMIT(T’.v) 

Postcondition:  (no  change) 

•  ABORT(T’) 

Postcondition:  (no  change) 

•  REQUEST-COMMIT(T,v) 

Postcondition:  awake(s)  =  false 

Note  that  a  spy  is  not  a  transaction  automaton,  but  rather  one  piece  of  a  transaction. 
It  wakes  up  when  its  associated  transaction  T  is  created  and  goes  to  sleep  when  T  requests 
to  commit.  That  is,  the  spy  automaton  does  not  request  to  commit;  instead,  it  receives 
REQUEST-COMMIT(T,v)  as  an  input  operation. 

While  a  spy  automaton  is  awake,  it  may  invoke  any  number  of  reconfigure-TMs.  In 
this  way,  the  model  formalizes  the  spontaneous  invocation  of  reconfigure-TMs.  The  user 
automaton  associated  with  a  given  spy  has  no  control  over  the  configurations  of  the  logical 
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data  items.  In  fact,  the  user  automaton  cannot  directly  observe  changes  in  configurations 
because  it  has  no  input  operations  that  could  reveal  this  information. 

It  is  interesting  that  in  our  definition  of  the  spy  automaton,  the  choice  of  when  to  change 
configuration  and  which  new  configuration  to  use  is  completely  general  (i.e.,  nondetermin- 
istic).  However,  one  might  wish  to  add  a  heuristic  for  making  judicious  choices  about  when 
and  how  to  intervene.  This  may  involve  adding  new  input  operations  and  state  components 
to  the  spy  that  would  allow  it  to  record  the  CREATE,  COMMIT,  and  ABORT  patterns  of 
accesses  to  the  logical  data  items.  The  nondeterminism  of  the  spy  automaton  allows  such 
heuristics  to  be  added  without  compromising  the  validity  of  our  results. 

Lemma  15  Let  user  automaton  T  be  a  transaction  that  does  not  have  any  REQUEST- 
CREATE(T’)  operations  defined  where  T’  is  a  reconfigure- TM.  Then  the  composition,  also 
named  T,  of  user  automaton  T  with  spy  automaton  T,pv  is  also  a  transaction. 

Proof:  It  suffices  to  show  that  the  composition  preserves  well-formedness.  Let  a  =  a'jr 
be  a  schedule  of  the  composition  where  jr  is  an  output  operation  of  the  composition,  and 
assume  that  a'  is  well-formed.  We  need  to  show  that:  (1)  CREATE(T)  occurs  in  a',  (2) 
no  REQUEST-COMMIT  operation  occurs  in  a',  and  (3)  if  ir  is  a  REQUEST-CREATE(T’) 
operation,  then  no  REQUEST-CREATE(T’)  occurs  in  a'. 

Since  user  automaton  T  is  a  transaction,  we  know  that  it  cannot  issue  an  output  operation 
unless  CREATE(T)  occurs  in  a'.  By  definition,  T,pp  can  issue  no  output  operation  unless 
its  awake  flag  is  true,  and  only  a  CREATE(T)  operation  can  set  awake  to  true.  Therefore, 
neither  user  automaton  T  nor  T,py  can  issue  an  output  operation  unless  CREATE(T)  occurs 
in  a'.  So,  part  (1)  holds  for  the  composition. 

Similarly,  for  part  (2),  we  know  that  if  ir  is  an  output  operation  of  user  automaton  T  then 
no  REQUEST-COMMIT  for  T  occurs  in  a'  because  the  user  automaton  is  a  transaction 
and  T,PV  does  not  issue  REQUEST-COMMIT  for  T  operations.  T,pv  can  issue  no  output 
j  operation  if  the  awake  flag  is  false,  and  only  a  REQUEST-COMMIT  for  T  operation  can 

!  cause  the  awake  flag  to  become  false.  Since  a'  is  well-formed,  it  can  contain  at  most  one 

(  CREATE(T)  operation.  So,  once  awake  becomes  false,  it  is  false  forever.  Therefore,  neither 

t 

|  user  automaton  T  nor  T,py  can  issue  an  output  operation  if  REQUEST-COMMIT  for  T 
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occurs  in  a'.  Thus,  part  (2)  holds  for  the  composition. 

Finally,  we  note  that  user  automaton  T  does  not  invoke  reconfigure-TMs  and  Tfpy  invokes 
only  reconfigure-TMs.  So,  it  is  sufficient  to  show  that  part  (3)  holds  for  user  automaton 
T  and  Tjpy  independently.  We  know  that  part  (3)  holds  for  the  user  automaton,  because 
it  is  a  transaction.  Whenever  it  issues  a  REQUEST-CREATE(T’),  T,py  puts  T’  into  its 
create- requested  list,  and  nothing  is  ever  removed  from  the  create-requested  list.  Since  a 
precondition  for  REQUEST-CREATE(T’)  is  that  T’  is  not  in  the  create-requested  list,  part 
(3)  must  hold  for  T,py.  Therefore,  part  (3)  holds  for  the  composition.  ■ 

4.1.5  Properties 

In  this  subsection,  we  prove  several  interesting  properties  of  system  B.  Most  of  the  subsec¬ 
tion  is  devoted  to  the  proof  of  Lemma  20,  which  is  central  to  our  correctness  argument. 

Lemma  16  Schedules  of  system  B  are  well-formed. 

Proof:  By  Lemmas  12,  13,  and  14,  DMs  are  basic  objects,  and  coordinators  and  TMs  are 
transactions.  Furthermore,  by  Lemma  15,  all  the  remaining  members  of  T  are  transactions. 
Therefore,  system  B  is  a  serial  system.  By  [LM],  all  schedules  of  serial  systems  are  well- 
formed.  ■ 

We  now  present  definitions  to  describe  the  logical  accesses  to  the  logical  data  items  in 
system  B.  These  definitions  are  analogous  to  those  for  the  fixed  configuration  system. 

Access  sequence:  Let  0  be  a  sequence  of  operations  of  system  B,  and  let  x  be  a  logical 
data  item  in  I .  Then  the  access  sequence  of  i  in  0,  denoted  access(x,  0),  is  defined  to  be 
the  subsequence  of  0  containing  the  CREATE  and  REQUEST-COMMIT  operations  for  the 
members  of  tm(z). 

Logical  state:  Let  0  be  a  sequence  of  operations  of  system  B,  and  let  x  be  a  logical 
data  item  in  I .  The  logical  state  of  x  after  0,  denoted  logical-state(x,  0),  is  defined  to  be 
either  va/ue(T)  if  REQUEST-COMMIT(T,v)  is  the  last  REQUEST-COMMIT  operation 
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for  a  write-TM  in  access(x,/3),  or  iz  if  no  REQUEST-COMMIT  operation  for  a  write-TM 
occurs  in  access(x,yS). 

Current  version  number:  Let  (I  be  a  sequence  of  operations  of  system  B,  and  let 
x  be  a  logical  data  item  in  I.  Let  last(x,fi)  denote  the  subset  of  acc(x)  such  that  for 
each  member  T  of  last(x, /?),  REQUEST-COMMIT  for  T  is  the  last  REQUEST-COMMIT 
operation  for  a  write  access  to  O(T)  in  fi  with  data(T).version-number^-L4 5.  The  current 
version  number  of  x  after  fi,  denoted  current-vn{x,  fi),  is  defined  as  follows.  If  last(x,/?)  is 
non-empty,  then  current-vn(x,  fi)  is  the  maximum  over  all  Telast(x,/?)  of  data(T). version- 
number.  Otherwise,  current-vn(x,  fi)  =  0. 

Current  generation  number:  Let  /?  be  a  sequence  of  operations  of  system  B,  and  let  x 
be  a  logical  data  item  in  I .  Let  last[x,  fi)  denote  the  subset  of  acc(x)  such  that  for  each  mem¬ 
ber  T  of  last(x,  fi),  REQUEST-COMMIT  for  T  is  the  last  REQUEST-COMMIT  operation 
for  a  write  access  to  O(T)  in  fi  with  data(T).generation-number^±6.  The  current  generation 
number  of  x  after  fi,  denoted  current-gn(x,  fi),  is  defined  as  follows.  If  last(x,  fi)  is  non-empty, 
then  current-gn(x,  fi)  is  the  maximum  over  all  Telast(x,/1)  of  data(T). generation-number. 
Otherwise,  current-gn(x,  fi)  =  0. 

Lemma  17  If  fi  is  a  schedule  of  B  and  x  is  a  logical  data  item  in  I,  then  access (x,fi) 
begins  with  a  CREATE  operation  for  some  TM  in  tm(x)  and  continues  alternately  with 
REQUEST-COMMIT  and  CREATE  operations  for  TMs  in  tm(x)  such  that  each  REQUEST- 
COMMIT  for  T  is  preceded  immediately  by  a  CREATE(T)  operation. 

Proof:  By  definition,  access(x,/?)  contains  only  CREATE  and  REQUEST-COMMIT 
operations  for  TMs  in  fm(x).  By  Lemma  16,  fi  is  a  well-formed  schedule,  so  each  REQUEST- 
COMMIT  for  T  must  be  preceded  by  a  CREATE(T)  operation.  Finally,  since  fi  is  a  serial 
schedule,  all  operations  for  a  given  transaction  must  be  contiguous.  ■ 

4This  last  condition  allows  us  to  consider  only  those  write  accesses  which  change  the  version  number  at 
a  DM 

5Here,  we  are  only  interested  in  those  write  excesses  which  change  the  generation  number  at  a  DM 
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Lemma  18  Let  i  be  a  logical  data  item,  and  let  0  be  a  schedule  of  B.  Then  the  following 
property  holds  after  0:  The  highest  version  number  among  the  states  of  all  DMs  in  cfm(x) 
is  current-vn(x,/9). 

Proof:  Since  DMs  are  read-write  objects,  the  only  operation  that  can  change  the  version- 
number  in  the  state  of  a  DM  O  for  x  is  a  REQUEST-COMMIT  for  T  operation,  where 
O(T)  =  O  and  T  is  a  write  access  with  data(T).version-number^±.  More  specifically,  the 
version-number  in  the  state  of  a  DM  O  after  0  is  data(T). version-number,  where  REQUEST- 
COMMIT  for  T  is  the  last  such  REQUEST-COMMIT  in  0.  In  the  definition  of  current- 
vn(x,/?),  the  set  last(x,y3)  contains  the  last  write  access  with  version- number for  each 
DM  in  dm(x)  that  has  a  REQUEST-COMMIT  for  such  a  write  access  in  0.  Therefore,  the 
maximum  over  all  T£last(x,/?)  of  data(T). version-number  is  the  highest  version  number 
among  the  states  of  all  DMs  in  dm(x)  after  0.  This  maximum  is  exactly  the  definition  of 
current-vn(x,/3).  (If  no  such  REQUEST-COMMITs  occur  in  0,  then  all  DMs  have  their 
initial  version-number,  which  is  0  by  definition.)  ■ 

Lemma  19  Let  x  be  a  logical  data  item,  and  let  0  be  a  schedule  of  B.  Then  the  following 
property  holds  after  0:  The  highest  generation  number  among  the  states  of  all  DMs  in 
dm{x)  is  current-gn(x,  0). 

Proof:  Analogous  to  that  of  Lemma  18.  ■ 

The  main  lemma  is  again  proved  by  induction  on  the  length  of  0.  We  take  advantage  of 
the  nesting  structure  in  the  proof  by  proving  simple  assertions  about  the  sub-transactions 
of  the  TMs,  and  then  using  these  simple  assertions  to  prove  the  main  assertions  about 
the  TMs.  As  before,  the  first  condition  is  only  used  for  carrying  through  the  inductive 
argument.  The  important  part  of  the  lemma  is  the  second  condition,  which  tells  us  that 
read-TMs  return  the  proper  value. 

Lemma  20  Let  x  be  a  logical  data  item  in  I .  Let  0  be  a  schedule  of  B  such  that  access(x,  0) 
is  of  even  length. 

1.  The  following  properties  hold  after  0\ 
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(a)  For  all  DMs  O  £  dm(x),  if  d  is  the  data  component  of  O,  and  d. generation- 
number  <  current-gn(x,  0),  then  there  exists  some  write-quorum  q  £  d. configurations 
such  that  for  all  DMs  O'  £  q,  if  d’  is  the  data  component  of  O'  then  d\ generation- 
number  >  d. generation-number. 

(b)  For  all  pairs  of  DMs  Oi,Oi  £  dm(x),  if  d\  and  dj  are  the  data  components 
of  O i  and  O2,  then  dj. generation-number  =  dj. generation-number  implies  that 
dj. configuration  =  d2  configuration.  Let  logical- config(x,  0)  denote  the  unique 
configuration  held  by  all  DMs  with  generation-number  =  current-gn(x,  0). 

(c)  There  exists  a  write-quorum  q  £logical-config(x,  /9).w  such  that  for  all  DMs  O  € 
q,  if  d  is  the  data  component  of  O,  then  d. version-number  =  current-vn(x,  0). 

(d)  For  all  DMs  O  £  dm(x),  if  d  is  the  data  component  of  O,  then  d. version-number 
=  current-vn(x,/9)  implies  that  d. value  =  logical-state(x,y9). 

2.  If  0  ends  in  REQUEST-COMMIT(T,v)  with  T£  tmr(x),  then  v  =  logical-state(x,  0). 

Proof:  By  induction  on  the  length  of  0. 

Base  case:  Let  0  be  the  empty  schedule.  By  definition,  current-gn(x, /?)  =  0,  current- 
vn(x,/3)  =  0,  and  logical-state(x,  0)  —  ix.  Initially,  all  DMs  in  dm(x)  have  generation- 
number  =  0,  configuration  =  config(x),  version-number  =  0  and  value  =  «,  by  the  definition 
of  a  DM.  Therefore,  logical-config(x,  0)  =  config(x).  Furthermore,  the  states  after  0  of  all 
the  DMs  in  every  q  €  config(x). w  have  generation-number  =  current-gn(x,  0),  configuration 
=  config(x),  version-number  =  current-vn(x,  0),  and  value  =  logical-state(x,  0).  Thus,  part 
1  holds.  Since  0  is  empty,  it  does  not  end  in  a  REQUEST-COMMIT  operation  of  a  read-TM 
for  x.  So,  part  2  holds  vacuously. 

Induction:  Let  0  =  0't,  where  access(x,r)  begins  with  the  last  CREATE  operation 
in  access(x,  0).  Assume  that  the  Lemma  holds  for  0' .  By  Lemma  17  and  the  fact  that 
access(x,  0)  is  of  even  length,  access(x,r)  =  (CREATE(T/),  REQUEST-COMMIT(T/,v/)) 
for  some  T/  €  tm(x)  and  \f  £  Vz.  We  note  the  following  facts  about  T j: 

Fact  1:  All  accesses  in  r  to  DMs  in  dm[x)  are  descendants  of  T j. 

Proof:  Since  0  is  a  serial  schedule,  T/  is  the  only  TM  in  tm(x)  whose  descendants 
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have  operations  in  r.  Furthermore,  the  system  type  of  B  is  constrained  so  that  all 
accesses  to  DMs  in  dm(x)  are  descendants  of  TMs  in  tm(x). 

Since  Ty  requests  to  commit  in  r,  we  know  by  definition  of  T/  that  at  least  one  read- 
coordinator,  a  child  of  Ty,  must  commit  to  Ty  in  r.  Let  T’  be  the  first  read-coordinator 
that  commits  to  Ty  in  r,  and  let  r'  be  the  portion  of  r  up  to  and  including  the  COMMIT 
for  T\ 

Fact  2:  If  s  is  the  state  of  T’  just  after  a  read  access  commits  to  T’  in  r',  then 

1 .  data(s)  generation-number  and  data(s)  .configuration  contain  the  highest  generation- 
number  and  associated  configuration  among  DMs  in  read(s),  and 

2.  data(s).  version-number  and  data(s). value  contain  the  highest  version-number 
and  associated  value  among  DMs  in  read(s). 

Proof:  This  fact  holds  because  T’  retains  the  maximum  generation-number  and 
version-number  (seen  so  far)  and  their  respective  configuration  and  value  upon  each 
commit  of  a  read  access.  Since  T’  is  the  first  child  that  commits  to  Ty  and  since  T’ 
invokes  only  read  accesses,  the  data  components  of  all  DMs  observed  by  T’  must  be 
the  same  during  r'  as  after  ft1. 

Fact  S:  The  data  component  of  the  state  of  Ty  forever  after  is  (current-vn(x, /?'), 
logical-state(x, /?') ,  current-gn(x,  ft),  logical-config(x, /?')). 

Proof:  Let  s’  be  the  state  of  T’  when  T’  issues  its  REQUEST-COMMIT  oper¬ 
ation.  Together,  part  1  of  Fact  2  and  part  (la)  of  the  induction  hypothesis  imply 
that  read(s’)  cannot  contain  a  read-quorum  according  to  data(s’). configuration  unless 
data(s’)  generation-number  =  current-gn(x, /?').  By  definition,  T’  cannot  commit  un¬ 
less  read(s’)  contains  a  read-quorum  in  data(s’). configuration. r.  Therefore,  by  part 
(lb)  of  the  induction  hypothesis,  we  know  that  read(s’)  contains  some  read-quorum 
r  in  data(s’)  configuration  =  logical-config(x, /?')  r.  By  part  (lc)  of  the  induction  hy¬ 
pothesis,  we  know  that  there  exists  some  write-quorum  u;  €  logical-config(x,  /?').w  such 
that  the  states  after  /?'  of  all  DMs  in  w  have  version-number  —  current-vn(x,  S').  Since 
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logical-config(x,  P')  is  a  legal  configuration,  r  and  w  must  have  a  non-empty  intersec¬ 
tion.  So,  by  part  2  of  Fact  2,  data(s’). version-number  =  current-vn(i,  P').  Therefore, 
by  part  (Id)  of  the  induction  hypothesis,  data(s’). value  =  logical-state^,^). 

When  T’  commits,  the  data  component  of  the  state  of  T/  becomes  data(s’).  By 
definition,  once  T’  commits,  the  data  component  of  the  state  of  T/  never  changes. 
Therefore,  Fact  3  is  proved. 

From  Fact  1,  we  know  that  all  accesses  to  DMs  for  i  in  r  are  children  of  T/.  Therefore, 
in  order  to  prove  that  the  induction  hypothesis  holds  for  P,  we  merely  need  to  demonstrate 
that  T /  preserves  the  properties  stated.  There  are  three  possibilities  for  T /: 

•  If  T^  is  a  read-TM,  then  T /  invokes  only  read-coordinators,  which  invoke  only  read 
accesses.  So,  current-vn(x,  p)  =  current-vn(x,  p')  and  current-gn(x,  0)  =  current- 
gn(x,/?').  Furthermore,  the  data  components  of  the  states  of  the  DMs  are  the  same 
after  P  as  after  0' .  Therefore,  part  1  of  the  Lemma  holds  for  p.  By  definition,  T / 
cannot  request  to  commit  until  at  least  one  of  its  read-coordinators  commits.  Since 
T’  is  the  first  committing  read-coordinator,  the  REQUEST-COMMIT  for  T /  must 
occur  at  some  point  after  p'r' .  When  Tj  commits,  it  returns  the  value  in  the  data 
component  of  its  state.  By  Fact  3,  this  value  is  logical-state^,/'?').  Since  T/  is  a 
read-TM,  logical-state(x,/9)  =  logical-state(x,  P')  by  definition.  Thus,  part  2  of  the 
Lemma  holds  for  p. 

•  If  Tf  is  a  write-TM,  then  we  note  the  following  facts: 

Fact  4:  All  write-coordinators  T  for  x  invoked  in  r  have  version-number(T) 
=  current-vn(x,  /3')+l,  value(T)  =  value(T/),  and  configuration^*,)  =  logical- 
config(x,/2'). 

Proof:  Let  s  be  the  state  of  T/  when  it  issues  REQUEST-CREATE(T).  Then 
by  definition  of  a  write-TM,  version-number(T)  =  data(s).version-number+l, 
value(T)  =  value(T/),  and  configuration(T)  =  data(s). configuration.  By  def¬ 
inition,  Tf  cannot  invoke  a  write-coordinator  until  at  least  one  of  its  read- 
coordinators  commits,  So,  all  REQUEST-CREATEs  for  write-coordinators  in  r 
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occur  after  /3't'  .  Therefore,  by  Fact  3,  data(s). version-number  =  current-vn(i,  0') 
and  data(s)  .configuration  =  logical-config(x,  /?').  Thus,  Fact  4  holds. 

Fact  5:  If  T  is  a  write  access  for  x  invoked  in  r,  then  data(T)  =  (current-vn(i,  0), 
logical-state(z,/3),_L, _L),  and  current-vn(x, 0)  >  current-vn(x, /9'). 

Proof:  The  type  of  system  B  is  constrained  so  that  T  is  invoked  by  some 
write-coordinator  for  x.  Therefore,  by  Fact  4  and  the  definition  of  a  write- 
coordinator,  data(T)  =  (current-vn(x,  $')+l,  value(T/),_L,  _L).  Therefore,  since 
current-vn(z,/9')  +  l  is  the  highest  (only)  version-number  for  x  written  in  r,  it 
follows  from  Lemma  18  and  the  definition  of  current-vn  that  current-vn(x,  0) 
=  current-vn(x,/?')  +  l.  Since  T /  is  a  write-TM,  logical-state(x, 0)  —  value(T/). 
Thus,  Fact  5  is  proved. 

By  Fact  5,  the  generation-numbers  and  configurations  in  the  states  of  DMs  for  x 
are  not  changed  during  r,  and  current-gn(x,/?)  =  current-gn(x, 0').  Therefore,  parts 
(la)  and  (lb)  of  the  Lemma  hold  after  0.  (Note  that  logical-config(x,  0')  =  logical- 
config(x,  /?).) 

By  definition,  T /  cannot  request  to  commit  until  at  least  one  of  its  write-coordinators 
commits.  Let  T^  be  the  first  write-coordinator  that  commits  to  T/,  and  let  r"  be  the 
portion  of  r  up  to  and  including  the  COMMIT  of  Tw.  By  definition,  Ttt  cannot  re¬ 
quest  to  commit  until  it  has  received  COMMITS  for  write  accesses  to  a  write  quorum 
of  DMs  in  configuration(Tw).  By  Fact  4,  configuration(T„,)  =  logical-config(x, /?'), 
which  equals  logical-config(x,  /?).  Therefore,  by  Fact  5,  parts  (lc)  and  (Id)  of  the 
Lemma  hold  after  /3't" . 

We  now  show  that  part  1  of  the  Lemma  still  holds  after  (3't.  By  Fact  5,  any 
write-coordinators  that  may  execute  in  r  after  t"  merely  propagate  the  new  value  and 
version  number.  Any  read-coordinators  that  may  execute  in  r  after  r"  cannot  change 
the  values  at  the  DMs,  since  they  do  not  invoke  write  accesses.  Therefore,  part  1  of 
the  Lemma  holds  after  (3't  =  0. 

Since  T/  is  not  a  read-TM,  part  2  holds  vacuously. 

•  If  T/  is  a  recon figure-TM,  then  we  note  the  following  facts: 
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Fact  6:  All  write-coordinators  T,,,  for  x  invoked  in  r  have  have  version-number  (Tto) 
=  current-vn(x,/3'),  value(TUJ)  =  logical-state(x,  ft),  and  configuration^*,)  = 
logical-config(x,  /3').  Furthermore,  all  reconfigure-coordinators  Trec  for  x  invoked 
in  t  have  generation-number(Trec)  =  current-gn(x,  /?')+l,  old-configuration(Trec) 
=  logical-config(x,  /?'),  and  new-configuration(Trec)  =  target-configuration(T/). 

Proof:  Analogous  to  that  of  Fact  4. 

Fact  7:  If  T  is  a  write  access  invoked  by  a  write-coordinator  for  x  in  r,  then 
data(T)  =  (current-vn(x, ^),  logical-state(x, £),  1,-L). 

Proof:  By  Fact  6  and  the  definition  of  a  write-coordinator,  data(T)  =  (current- 
vn(x,/?'),  logical-state^,/?*),  -L,-L).  Since  these  are  the  only  write  accesses  in  r 
that  modify  the  version-number  components  in  the  states  of  DMs  for  x,  we  know 
by  Lemma  18  and  the  definition  of  current-vn  that  current-vn(x,  /?)  =  current- 
vn(x, S')-  Since  T/  is  a  reconfigure-TM,  logical-state(x, /?)  =  logical-state(x, /?') 
by  definition. 

Fact  8:  If  T  is  a  write  access  invoked  by  a  reconfigure-coordinator  for  x  in  r, 
then  data(T)  =  (±,  J_,current-gn(x,/?),logical-config(x, /3) ),  and  current-gn(x, 0) 
>  current-gn(x,  S'). 

Proof:  By  Fact  6  and  the  definition  of  a  reconfigure-coordinator,  data(T)  = 
(i.,_L,  current-gn(x,/9')  +  l,target-configuration(T/)).  Therefore,  since  current- 
gn(x,  /?')-+- 1  is  the  highest  (only)  generation-number  for  x  written  in  r,  it  follows 
from  Lemma  19  and  the  definition  of  current-gn  that  current-gn(x,/?)  =  current- 
gn(x,^')+l.  Also,  target-configuration(Ty)  is  the  configuration  associated  with 
current-gn(x,  0),  which  is  logical-config(x,  0)  by  definition. 

By  definition,  T /  cannot  request  to  commit  until  at  least  one  of  its  write-coordinators 
commits.  Let  Tw  be  the  first  write-coordinator  that  commits  to  T /,  and  let  r"  be  the 
portion  of  r  up  to  and  including  the  COMMIT  of  T^,.  We  claim  that  part  1  of  the 
induction  hypothesis  holds  after  0't" .  There  are  two  cases: 


1.  If  r"  does  not  contain  a  COMMIT  of  a  reconfigure-coordinator,  then  by  Fact  7, 
any  write  accesses  invoked  in  r"  simply  propagate  the  current  value  and  version 


68 


CHAPTER  4.  RECONFIGURABLE  QUORUM  CONSENSUS 


number,  so  part  1  still  holds. 

2.  If  t"  does  contain  one  or  more  COMMITS  of  reconfigure-coordinators,  then  each 
reconfigure-coordinator  Tres  cannot  commit  until  it  has  received  COMMIT  oper¬ 
ations  for  write  accesses  to  a  write-quorum  w  of  DMs  in  old-configuration(Trec). 
We  now  show  that  part  (la)  of  the  lemma  holds  after  0't" .  There  are  two 
classes  of  DMs  to  consider:  (1)  All  DMs  that  have  generation-number  =  current- 
gn(x,/?')  after  0't"  must  have  configuration  =  logical-config(x,  /?')  by  Fact  8  and 
part  (la)  of  the  induction  hypothesis.  By  Fact  6,  old-configuration (T„,)  =  logical- 
config(x,^/).  Therefore,  w  E  logical-config(x, /?').w,  so  part  (la)  holds.  (2)  For 
all  DMs  that  have  generation-number  <  current-gn(x, /?'),  we  know  by  Fact  8 
and  Lemma  19  that  no  DM’s  generation-number  could  have  been  decreased  in 
r".  So,  by  part  (la)  of  the  induction  hypothesis,  part  (la)  still  holds. 

By  Fact  8  and  Lemma  19,  we  know  that  all  write  accesses  for  x  in  r"  with 
generation-number^l.  have  a  generation-number  greater  than  (and  therefore  dif¬ 
ferent  from)  any  generation-number  for  x  in  /?'.  Furthermore,  since  all  such  write 
accesses  have  the  same  generation-number  and  configuration,  we  know  by  part 
(lb)  of  the  induction  hypothesis  that  part  (lb)  still  holds.  By  definition,  T„, 
cannot  request  to  commit  until  it  has  received  COMMIT  operations  for  write 
accesses  to  a  write  quorum  of  DMs  in  configuration^*,).  Therefore,  by  Fact  7, 
parts  (lc)  and  (Id)  hold  after  /3'r" . 

Thus,  claim  is  true  in  both  cases,  so  part  1  of  the  lemma  holds  after  /?'r". 

By  Fact  7,  any  write-coordinators  that  may  execute  in  t  after  r"  merely  propagate 
the  new  value  and  version  number,  so  they  preserve  part  1  of  the  induction  hypothesis. 
Similarly,  by  Fact  8,  any  reconfigure-coordinators  that  may  execute  in  t  after  r" 
merely  propagate  the  new  configuration  and  generation-number.  And  certainly  any 
read-coordinators  that  may  execute  in  r  after  t"  cannot  change  the  data  components 
of  the  DMs.  Therefore,  part  1  of  the  induction  hypothesis  holds  when  T/  commits. 

Since  Tf  is  not  a  read-TM,  part  2  holds  vacuously. 


For  all  three  possibilities  of  T /,  the  lemma  holds  after  /?. 
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4.2  Non-replicated  Serial  System 


We  define  non-replicated  serial  system  A  of  type  (7a, parent  a,  0a  .V'a)  *n  terms  replicated 
serial  system  B  of  type  ( 7b , parents, 0J9 Yb)-  The  transactions  that  were  read-TMs  and 
write-TMs  for  objects  in  I  in  system  B  become  accesses  in  system  A,  and  the  collection  of 
DMs  for  each  object  in  I  in  system  B  are  replaced  by  a  single  read-write  object  in  system  A. 
The  reconfigure-TMs,  coordinators  and  accesses  from  system  B  are  not  present  in  system 
A  More  formally,  the  system  type  is: 


Ta  =  Tb  -  l\J  acc(z)'j  -  (  U  co(x))  -  (  U  tmrw(z) 
Vie/  /  Vie/  /  Vi  el 


•  parentA  =  parent^  restricted  to  7a 


0a  =  0B  -  ^  (J  dm(x)j  U  {tmr(x)  U  tmw(x)  U  tmTO:(z)|z  6  1} 


•  Va  -  VB 


Informally,  to  construct  the  type  of  system  A  from  that  of  system  B,  we  first  remove 
from  T  all  the  coordinators,  all  the  reconfigure-TMs,  and  all  the  accesses  to  the  DMs  for 
objects  in  7.  As  a  result,  all  the  TMs  for  objects  in  7  become  leaves  in  T  and  are  therefore 
accesses  Next,  we  remove  from  0  all  the  DMs  for  objects  in  7-  (Effectively,  this  has  already 
been  done  by  removing  the  corresponding  accesses.)  Finally,  we  partition  all  the  accesses 
that  were  formerly  TMs  according  to  their  logical  data  item.  Each  class  of  this  partition  is 
a  new  object  in  0.  Thus,  each  logical  data  item  is  implemented  by  a  single  object. 

Figure  4.2  illustrates  the  transaction  tree  for  system  A  that  corresponds  to  the  transac¬ 
tion  tree  for  system  B  given  in  Figure  4.1. 


The  following  lemma  tells  us  that  that  the  function  Tab  is  well-defined.  This  allows  us 
to  relate  transactions  in  system  B  to  those  in  system  A. 


Lemma  21  System  B  is  an  extension  of  system  A. 


-  >7’ 


■r  -r.v.  s.  -■  s,  s  s.  s. s. - 
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Figure  4.2:  The  transaction  tree  for  system  A  that  corresponds  to  the  transaction  tree  for 
B  shown  in  Figure  4.1.  Transactions  are  labeled  as  follows: 

U  =  user  transaction;  a,b,x,y  =  accesses. 

Proof:  Since  TA  -  TB  -  -  (U*e/co(x))  ~  (Urej4"*™^1)).  we  know  that 

Ta  C  Tb.  Furthermore,  TA  and  TB  must  have  the  same  root,  unless  the  root  of  TB  is  in 
acc(x),  co(x),  or  tm„c( x)  for  some  x  €  I.  However,  every  member  of  acc(x)  is  a  child  of 
.ome  member  of  co(x),  which  in  turn  is  a  child  of  some  member  of  tm(x).  In  addition, 
every  member  of  Im^i)  is  a  child  of  some  user  transaction.  So  none  of  the  transactions 
in  acc(z),  co(x),  or  tmrec( x )  for  any  i  El  could  be  the  root  of  TB.  ■ 

We  define  user  transactions  in  system  A  to  be  all  non-access  transactions  in  TA.  Just 
as  for  the  fixed  configuration  systems,  we  note  that  T  is  a  user  transaction  in  system  B 
iff  ?ba( T)  is  a  user  transaction  in  system  A.  Transactions  and  objects  in  system  A  are 
modelled  in  the  same  way  as  in  system  B,  except  that  for  all  z  €  I , 

1.  the  object  corresponding  to  tm(x )  has  an  associated  read-write  object  O  over  domain 
Vx  (We  refer  to  this  particular  read-write  object  as  O(x).), 

2.  for  each  transaction  Te  im(z)  where  T  is  a  read-TM  or  write-TM,  Tb^T)  is  an  access 
to  0(x)  such  that 
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(a)  if  T  is  a  read-TM,  then  ?ba(T)  is  a  read-access,  and 

(b)  if  T  is  a  write-TM,  Tba{ T)  is  a  write-access  with  data(7Bx(T))  =  value(T). 

Furthermore,  if  T  is  a  user  transaction,  then  it  is  modelled  by  the  same  user  automaton  as 
in  system  B,  but  without  an  associated  spy  automaton. 

4.3  Correctness 

In  this  section,  we  show  that  user  transactions  cannot  distinguish  between  replicated  serial 
system  B  and  non- replicated  serial  system  A.  The  proof  is  analogous  to  that  of  Theorem 
10,  but  this  time  making  use  of  Lemma  20.  We  are  only  interested  in  the  correspondence 
between  the  schedules  of  the  user  automata  in  systems  A  and  B.  So,  in  Condition  2  of  the 
theorem,  we  only  require  that  the  user  automata,  rather  than  the  user  transactions,  have 
the  same  schedules. 

Theorem  22  Let  /J  be  a  schedule  of  replicated  serial  system  B.  There  exists  a  schedule  a 
of  non-replicated  serial  system  A  such  that  the  following  two  conditions  hold. 

1.  For  all  objects  O  in  system  B  that  are  not  in  dm(x)  for  any  z,  ar|0  =  0\O. 

2.  For  all  user  transactions  T  in  system  B,  cx\Tba[T)  =  0\T uter,  where  Tofer  is  the  user 
automaton  associated  with  T. 

Proof:  We  construct  a  by  removing  from  0  all  the  REQUEST-CREATE(T),  CRE¬ 
ATE^),  REQUEST-COMMIT(T.v) ,  COMMIT(T,v),  and  ABORT(T)  operations  for  all 
transactions  T  in  acc(z),  co(z),  and  tm^x),  for  all  x  €  I .  Clearly,  the  two  conditions 
hold.  What  needs  to  be  proved  is  that  a  is  a  schedule  of  A.  We  proceed  by  induction  on 
the  length  of  0. 

Base  case:  Suppose  0  is  empty.  Then  a  is  also  empty  and  is  therefore  a  schedule  of  A. 
Induction:  Let  0  —  0'np,  where  the  claim  holds  for  0'.  Let  a  =  a'xa,  where  a'  is  the 
schedule  of  A  corresponding  to  0' .  There  are  five  cases  for 
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read-write  object,  v'  is  the  value  in  the  state  of  O(z)  after  a'.  We  observe  that,  by 
the  construction,  a'|0(x)  =  access(z,  01).  So,  by  definition  of  system  A,  the  last  write 
access  in  a'  to  O(x)  has  the  same  value  as  the  last  write-TM  in  0' .  Hence,  the  value 
in  the  state  of  O(x)  after  a'  is  logical-state(z,  01).  Therefore,  v  =  v' .  (If  there  is  no 
write-TM  in  access(x,  0'),  then  there  are  no  write  accesses  to  O(x)  in  a'.  In  this  case, 
the  value  in  the  state  of  O(z)  after  a'  is  tx,  which  is  logical-state(x, 01).) 

5.  Output  Operations  of  the  Scheduler  (except  those  already  covered  by  Case  1): 
If  *0  is  a  CREATE(T),  a  COMMIT  for  T,  or  an  ABORT(T),  where  T  is  a  user 
transaction,  T  is  a  non-replica  access,  or  T  is  in  tmr(x)  or  tmw(i)  for  some  x  E  I , 
then  by  the  construction  xa  —  X0. 

If  x 0  is  a  CREATE(T)  or  ABORT(T),  then  the  preconditions  for  xa  are  (1)  there 
must  be  a  REQUEST-CREATE(T)  in  a'  but  no  CREATE(T)  or  ABORT(T)  in  a\ 
and  (2)  all  siblings  of  7ba{ T)  with  creates  in  a'  must  have  returned  (committed 
or  aborted)  in  a'.  Since  0  is  a  well-formed  schedule,  REQUEST-CREATE(T)  is  in 
0' ,  and,  by  the  construction,  is  in  a'  as  well.  Similarly,  since  no  CREATE(T)  or 
ABORT(T)  can  occur  in  0' ,  none  can  occur  in  o'  either.  Therefore,  precondition  (1) 
is  satisfied.  By  the  construction,  all  commits  and  aborts  in  a1  of  siblings  of  7ba( T) 
must  also  appear  in  0' .  So,  since  0  is  well-formed,  precondition  (2)  must  also  be 
satisfied. 

If  x 0  is  a  COMMIT(T,v),  then  the  preconditions  for  xa  are  (I)  a  REQUEST- 
COMMIT(T,v)  must  occur  in  a',  (2)  7ba{ T)  cannot  have  a  COMMIT  or  ABORT 
in  a',  and  (3)  any  children  invoked  by  7ba{'7)  must  have  returned  in  a1.  Using  the 
same  argument  as  above,  by  the  construction  and  the  fact  that  0  is  well-formed, 
preconditions  (1)  and  (2)  must  be  satisfied.  If  T  is  a  non-replica  access  or  T  is  in 
tmr{x)  or  tmw(x )  for  some  x  E  I ,  then  7ba{ T)  cannot  have  any  children  in  A.  If  T 
is  a  user  transaction,  then  all  return  operations  of  the  children  of  T  in  0'  are,  by  the 
construction,  included  in  a'.  Therefore,  since  0  is  well-formed,  precondition  (3)  must 
be  satisfied. 


In  all  cases,  a  is  a  schedule  of  A. 
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CHAPTER  4.  RECON  FIG  URABLE  QUORUM  CONSENSUS 


4.4  Concurrent  Replicated  Systems 

Just  as  for  the  fixed  configuration  algorithm,  we  now  complete  the  correctness  argument 
by  showing  that  non-serial  replicated  systems  are  correct.  We  proved  in  the  simulation 
argument  of  Theorem  22  that  in  every  schedule  of  system  B,  the  user  automata  have 
the  same  schedules  as  their  corresponding  transactions  in  some  schedule  of  system  A.  In 
particular,  we  proved  a  property  only  of  the  user  automata,  not  of  the  user  transactions 
(which  include  the  spies).  Since  it  is  the  user  automata  alone  that  model  the  users  of  the 
system,  this  property  is  all  that  was  required  for  correctness.  (In  fact,  it  would  have  made 
no  sense  to  include  the  spies,  since  their  output  operations  are  not  even  defined  in  system 

A.) 

For  the  correctness  of  non-serial  systems,  we  continue  this  pattern  and  again  consider 
only  the  user  automata.  To  this  end,  we  introduce  the  following  operator,  which  removes, 
from  any  sequence  of  operations,  those  operations  of  the  spy  automata  that  are  not  also 
operations  of  the  user  automata. 

Let  C  be  any  system  having  the  same  type  as  system  B,  and  let  7  be  a  sequence  of 
operations  of  system  C.  Then  hidcfa)  denotes  the  subsequence  of  7  containing  all  operations 
except  the  REQUEST-CREATE(T),  COMMIT  for  T,  and  ABORT(T)  operations  with 
Tetmree. 

With  this  definition,  we  can  now  state  the  final  theorem. 

Theorem  23  Let  C  be  any  system  that  has  the  same  type  as  system  B,  and  let  the  set  of 
user  transaction  automata  in  C  be  the  same  as  in  B.  Assume  that  all  schedules  7  of  C  are 
serially  correct  with  respect  to  serial  system  B  for  all  non-orphan  non-access  transactions 
Then  for  all  schedules  7  of  C,  hide(7)  is  serially  correct  with  respect  to  system  A  for  all 
non-orphan  user  transactions. 

Proof:  An  immediate  consequence  of  Theorem  22,  and  the  fact  that  the  hide  operator 
removes  from  7  exactly  those  operations  of  the  user  transaction  automata  (composition  of 
user  automata  and  spies)  in  system  C  that  are  not  operations  of  the  user  automata  (without 
the  spies).  ■ 
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So,  any  concurrency  control  algorithm  that  provides  serializability  at  the  level  of  the 
copies  may  be  combined  with  the  Reconfigurable  Quorum  Consensus  replica  management 
algorithm  to  produce  a  correct  system.  The  discussion  in  Section  3.4  of  such  concurrency 
control  algorithms  applies  here  as  well. 


Chapter  5 

Conclusion 


We  have  presented  a  precise  description  and  rigorous  correctness  proof  for  a  generalization  of 
Gifford’s  data  replication  algorithm  that  accommodates  nested  transactions  and  transaction 
failures.  The  algorithm  was  described  using  the  new  Lynch-Merritt  input-output  automaton 
model  for  nested  transaction  systems,  and  the  correctness  proof  was  constructed  directly 
from  this  description. 

The  algorithm  was  decomposed  into  simple  modules  that  were  arranged  naturally  in  a 
tree  structure.  This  use  of  nesting  as  a  modelling  tool  enabled  us  to  use  standard  assertional 
techniques  to  prove  properties  of  transactions  based  upon  the  properties  of  their  children. 

Each  module  was  described  in  terms  of  an  automaton  that  made  extensive  use  of  nonde¬ 
terminism.  Although  one  would  not  actually  implement  a  system  in  this  way,  the  nondeter¬ 
minism  permitted  us  to  construct  a  correctness  proof  that  was  independent  of  any  particular 
programming  language  or  implementation.  In  other  words,  the  nondeterministic  automata 
describe  the  basic  requirements  of  the  algorithm,  and  our  proof  implies  the  correctness  of 
any  specific  implementation  that  meets  these  requirements. 

The  modularity  of  the  proof  strategy  permitted  us  to  separate  the  concerns  of  replication 
from  those  of  concurrency  control  and  recovery.  Our  arguments  were  simple,  in  part,  because 
of  this  separation.  That  is,  we  were  able  to  deal  exclusively  with  serial  systems  in  order 
to  simplify  our  reasoning.  Then,  to  complete  the  proof,  we  presented  a  simple  theorem 
which  stated  that  combining  any  correct  concurrency  control  algorithm  with  our  replication 
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algorithm  yields  a  correct  system. 

This  work  has  identified  a  general  framework  for  proving  the  correctness  of  data  repli¬ 
cation  algorithms  in  nested  transaction  systems.  One  begins  by  constructing  a  formal  de¬ 
scription  of  the  algorithm  in  terms  of  a  nested  transaction  system  built  from  I/O  automata. 
Then,  one  uses  the  appropriate  definitions  to  show  that  each  logical  read  access  returns 
the  proper  value.  Next,  one  constructs  a  corresponding  serial  system  without  replication, 
and  proves  that  the  user  transactions  in  that  system  have  the  same  executions  as  the  user 
automata  in  the  replicated  system.  Finally,  one  proves  separately  the  correctness  of  the 
concurrency  control  algorithm,  and  applies  a  result  analogous  to  Theorem  23  to  show  that 
the  combined  system  is  correct. 

One  possible  direction  for  further  work  involves  using  this  general  technique  to  add 
transaction  nesting  to  other,  more  complicated,  data  replication  schemes,  and  prove  the 
resulting  algorithms  correct.  Some  interesting  examples  include  the  “Virtual  Partition” 
approach  of  Abbadi  and  Toueg  [AT],  and  Herlihy’s  “General  Quorum  Consensus”  [He], 

Some  replication  algorithms  guarantee  weaker  correctness  conditions  than  the  one  pre¬ 
sented  here  for  Gifford’s  algorithm.  It  would  be  interesting  to  see  what  impact  these  weaker 
correctness  conditions  would  have  on  the  proof  structure  that  we  have  presented. 
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